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ABSTRACT ° 

This study empirically investigated the validity and 
utility of the stratified adaptive computerized testing model 
<stradaptive]deyeloped by Weiss (1973). The model presepts a tailored 
testing strategy based on Binet I<J measurement theory and Lord's 
(1972) modern test theory, nationally normed School and College 
Ability Test Verbal analogy items (SCAT**?) were used to construct an 
itjam pool. Item difficulty and discrimination indices were rescaled 
to normal ogive parameters on 249 items. Freshmen volunteers at 
Florida State University were randomly assigned to stradaptive or 
conventional test groups. Both groups were tested via 
c«thode-ray-tube (CRT) terminals coupled to a Control Data 
Corporation 6500 computer. The conventional Subjects took a SCAT-.? 
te5t, while the stradaptive group took ..individually ^ilored tests 
driiwn from the saMe item. pool. Results "showed significantly higher 
];eliability for the stradaptive group, and equivalent. Validity 
indices between stradaptive and conventional groups. Three ~ 
stradaptive testing strategies aver'age^ 19.2, 26.5, and 31.5 items 
per subject as compared Kith 48.4 items per conventional subject. A 
50J reduction from conventional test length produced an equal 
prebision of measurement for stradaptive subjects. Item latency 
comjparisons showed the stradaptiVe group required significantly 
longer per ^ item tha;t conventional group members. It is recommended 
tliat time rather then nu-mbe'r of items be used in future :adaptive 
research as- a dependent variable. (Author /DEP) 
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emhricaLinvestigationof thestradaptive 
testing model for the measurement 
of human ability 



L INTRODUCTION 

This study investigated the validity and utility of the stratified adaptive, ''stradaptive" computerized 
testing model proposed by Weiss and his colleagues in the Psychometric Methods Program, University of 
Minnesota. The stradaptive model, theoreticaOy, could provide a highly efficient meansof assessing abiUty in 
large-scale testing situations. Such a model could readfly be implemented in military training or industrial 
selection and classification situations. 

The model is based upon the early work of Binet in the measurement of intelligence and upon Lord's 
recent theoretical research in tailored testing. The model ako utilizes modem latent trait theory and parameter 
estimates as detailed in Lord and Novick(1968). 

Weiss and his associates have reported the theoretical development of the stradaptive model (Weiss, 
1973; DeWitt A Weiss, 1974; McBride & Weiss, 1 974) including some examples of individual resulu. To date, 
no full empirical studies of the model have been published. Weiss' exploratory evidence appears promising, but 
leaves many questions unanswered. He suggests ten possible scoring methods, yet oifeis no evidence as to the 
''best" method. The evahiation of scoring methods appropriate for tailored testing was one of the secondary 
goakofth*^ study . TJie prinury goal of this study was the validation of the model itself . 

Cbmp risons were made between the stradaptive group test scores and conventional group test scores, 
both presented via a cathode-ray-tube mode of testing. Rehability and validity indices relative to the ^eciflc 
subject sample used in this experiment were calculated. 

The stradaptive model is very sensitive to the accuracy of item parameter estimates. In order to minimize 
item parameter estimation errors, a large norming group is essential. Weiss and his colleagues were well aware of 
this constraint, and have suggested specific procedures for establishing a reliable item pool for adaptive testing 
(Larkin A Weiss, 1 974). Nevertheless, the item pool used in their reported examples of stradaptive testing were 
based on item parameter estimates calculated from nomiing groups of less than 200 subjects. In this current 
study, items from the School & College Ability Test (SCAT) Series II Veri)al Ability test (1966) which had 
been nationally normed on a group of 3133 examinees comparable to the subjects in this experiment were 
used. These items should provide more tmstworthy item parameters for use in the investigation of the model. 

Determining the merits of a partteular testing strategy has been a major problem in previous studies of 
tailored testing. In any kind of tailored test, different examinees take different test items, thus prohibiting 
many classical measurement indices of ''goodness." Reliability assessment, particularly, has suffered due to 
this problem. Traditional internal consistency calculations are not possible, and procedures such as Hoyt's 
((941) ANOVA reliability estimate apparently have unacceptable underlying assumptions (such as item 
independence when applied to tailored testing). One goal of this study was to determine an alternate form 
reliability of the stradaptive test scores and to compare this index with a Hoyt-type reliability index. This 
alternate form reliability index would provide a measure of the ''goodness" of the stradaptive model as well as 
of the ANOVA reliability estimation procedure. 

Validity, as well as reliability, must be adequate for a testing strategy to be "good." Eighty-seven of the 
103 subjects in this experiment had previously taken the Florida 12th Grade Verbal test composed of items 
identical in form to the SCAT Series II Verbal items and 12 subjects had 12th Grade Verbal score estimates 
derived from American College Testing (ACT) or College Entrance Examination Board (CEEB) Verbal Test 
scores. Both the Rorida 12th Grade test and the SCAT tests were produced by Educational Testing Service 
(^S) and purportedly measured the same psychological dimension. Like thfe SCAT, the Florida 12th Grade 
was normed on a large sample of subjects comparable to the subjects in this experiment. Thus, the 1 2th Grade 
scoresprovided ideal external criteria scores for the stradaptive validity examination. 

Item latency data was collected on all subjects in this experiment. Since each item was tailored to the 
examinee's ability level, it was hypothesized that examinees on a tailored test would take more time per item 
than on a conventional test. If this hypothesis were supported, the dimension of testing time must be 
considered in evaluating a tailored testing model. 
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Theie is littk doubt that the use of interactive computer testing will increase enormously in the coming 
decade. Research in this area has just started to reveal some of the potential benefits of tailored testing to 
institutions and indniduak alike. Improved measurement accuracy and efficiency through the use of some 
kind of adaptive, computer-based testing, appear to be among these potential benefits. This study empirically 
investigated one such proposal, the stradaptive testing model 



11. REVIEW OF RELATED RESEARCH 

As the term implies, ad^tive testing is defined as a method of test construction herein the items 
presented to a specifk subject are selected iterativdy dependent q>on his previous responses, thus ''adapting" 
the test to the subject. Many terms have been used in the literature to refer to such anitem selection strategy 
(Table 1). In this paper, the con^rehensive term **adaptive testing" wil be used to include any or all of the 
testing strategies listed in TaUe 1 . 

Adaptive testing had its beginnings in the early work of Binet on the measurement of intelligence. THe 
original Binet scale and the current version, the 1960 Stanford-BinetScaks(Terman&Merrill, 1960)utilized 
an adi^ptive strategy to estimate a subject's IQ.The testing begins with the examiner selecting the fintitem to 
be presented, baisHl upon his judgment of the subjects ability levd. Once testing starts, the examiner may 
present the items in varying orders, based somewhat upon examinee req>onses. The basd and ceiling ages of the 
subject are estimated in order to present items whkh are neither too easy nor too hard for tKetubject. This is 
done through the construction of groiq>s of items ^oae difficulties are centered around **mental ages,** that is, 
**peaked" tests aie formed in ^ch about 50% of the norming group of that duonologk;al age responded with 
a correct answer to thoae items. Thus, the Stanford-Kn^t can be looked upon as a series of mini-tests designed 
to provide an efficient measure of the ability of each subject . 

theoreticaly , individual testing, as irl the case of the Binet, should provide more accurate measurement 
than group testing. Nevertheless, individuil testing strategies do have weaknesses. Obvfously, the major 
problem is the cost of administration. The* ; tests must be admlnisteied by a highly-trained examiner working 
on a one-to-one basis with the subject. Sucli expenditure may be warranted for an individual case basis i^en 
subjects are referred through external evalujjtions, but are clearly impractkal on any large scale. 

In addition to the cost deterrent, incj/ividual testing is plagued by several more technkal probtems. Weiss 
and Betz (1973) cite numerous researdi studies su^s^ting differential examiner effects. Differential Koring 
effects were cited, as well as interactk>n effects between the personality and social attributeSiOf both examiners 
and examinees. Thus, the theoieticat gains in measurement efficiency attributed to an individual testing 
strategy may well be offset by the added variance in test scores due to uncontrolled factors in the testing 
proceu. 

The paper and pmti mode of item presentation is, of course, the most common testing strategy. An 
enormous volume of theoretkal and empirical work has been done under the banner of classical measurement 
c theory . This fkld has made giant strides throu^ the reduction of measurement error and thus, the improved 

utility of the scales. Many practkal situations demand that aU subjects must take the same coDec tion iDf test 
items, with identkal time limits, via the p^>er and pencil mode of presentation. Nevertheleu, it must be 
realized that certain Unitations are inherent in conventional test administration. 

Careful trainii^ and standardization of group test administrators is intended to control for many of the 
inadequacks of individual test administration. Researdi evidence exists whkh shows that uncontrolled 
examiner variables are stiD present. Weiss and Betz (1973) extensively discussed five nujor areas in whidi 
unwanted variance enters tfie group measurement process: 

t. Administrator variables,such as sex or race: 

2. Answer sheeteffects, in v^ch answer sheet fornutsdifierentiaUy affect 

3. Item anangeiment effects within a test: 

4. Umingand time limit effects: 

and / 

5. An effect resulting from the standardized set of items ^kh is administered to an examinees. 

Stanley ( 1 97 Usuggests that the effective lengtli of a test is considerably shorter than the act ual leng^iof 
the test for a specific examinee, since many items are too easy and many are too hard. The easy items/arTa 
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TABLE 1 



Alternate Terminology Used to Describe Adaptive 
Testing Strategies and Their References 



TESTING STRATEGY 



REFERENCES 



ADAPTIVE 



BAYESIAN 
BRANCHING 

FLEXI LEVEL 

MULTI -LEVEL 
PROGRAMMED 

RESPONSE-CONTI NGENT 
SEQUENTIAL 



TAILORED 



Kappauf, 19691 Wood, 1971 1 Wood, 197?. t 
Betz & Weiss, 1973« Weis?, 1973* Weiss ft 
Betz, 19731 DeWltt ft Weiss, 197^ « LarMr 
ft Weiss, 197** I McBrlde.ft Weiss, 197^ 

Novlck, 19691 Owen, 1969, Urry, 1970 1 
Urry, 1971 

VHaters, 1964 1 Bayroff, 1969 1 Waters, 
1970 I Bayroff, 1971 1 Waters ft Bayroff, 
1971, Bryson, 1971 

Lord, 1971b, di Olivier, 1973 « 
Olivier, 197^ 

Angoff ft Huddles ton, 1958 

Bayroff, 1964 1 Hubbard, I9661 Bayroff 
ft Seeley, I9671 Cleary, Linn ft Hock, 
1968a, b I Linn, Rock ft Cleary, 1969 

Wood, 1973 

CoWden, 1946 1 Wald, 1946 1 Moonan, 2Q'50j 
Krathwohl" ft Huyser, 1956 1 Bayroff, 
Thomas ft Anderson, 1960| Paterson, 1Q^2: 
Seeley, Morton ft Anderson, 1962 « 
Cronbach ft Gleser, I9651 Hansen, 1969i 
ZKappauf , 1969* Linn, ^ck ft Cleary, 
1970 1 Wood, 1971 1 Wood, 1972 

Lord, 19681 Owen, I9691 Owen, 1970, 
Stocking, 19691 Wood, 1969 « Green, 1970 1 
Holtzman, I97O1 Lord, 1970 1 Lord, 
1971a, c,e I Kalisch, 197^ 



waste of time and testing costs, while the too hard items encourage guessing and add all the measurement 
problems associated with this source of extraneous variance. Thus, astandard set of items, peaked at the mean 
of the norming group is only truly optimal for a subject of mean abili^ on the dimension being measured. 
Consistent with this, information theory research has shown that a test peaked at a difficulty value of .5 
provides optimum measurement (maximizes internal con%isietKy) for examinees of the subject's ability level 
(Hick, 1951;Lord, 1970, 1971, 1971a, 1971d, 197le). 

In addition to the prevk>usly mentioned problem wbetein4he standard set of items contributes to 
guessing, another serious problem arises. Many research studies have shown that guessing Is not a consistent 
trait throughout the abiUty continuum (Lord, 1957, 1959; Baker, 1964; NunnaUy, 1967; Boldt, 1968). Low 
ability subjects guess more often than high ability subjects, creating differential measurement accuracy along 
the ability continuum. 

The literature implies that both conventional paper and pencil group tests, and traditional, individually 
administered tests are not always optimally suited to large-scale ability testing. Adaptive testing appears to 
offer a feasible and practical alternative to these two modes of test administration. It involves selecting a test 
item for presentation based upon the subject's response to the previous item or items. 

The prin'ciide underlying the Binet testing strategy-^.g., that the difficulty of the test items selected for 
a given subject should be peaked around the subject *s abiUty level, not the total group's ability Ievd,is also the 
basis of the stradaptive model. 

Considerable research has been done in the last twenty years to find a method of testing which will 
accomplish, this goal. Figure 1 depicts a three dimensional (3x2x2) model of adaptive testing research 
strategies categorized according to (1) type of research (empirical, simulated or theoretical); (2) whether the 
number of items (or stages) is fixed for all examinees; and (3) whether the item difficulty step-size between 
stages is fixpd or variable throughout the test. 

Tabic 2 lists the particular oells of Figure 1 with research studies reviewed rioted in the appropriate cells. 
It IS hoped that Table 2 will provide a helpfiil reference to the literature for future researchers concerned with 
adaptive testing. The balance of this literature review will refer jtb Table 2 and discuss research results 
cell-by-cell. 

Any classification system such as that used in Table 2 and Figure 1 require many arbitrary categorization 
deciauons. For the purposes of this paper, an empirical study was defined as one in which "reaHive" subjects 
provided the source of the da ta in a research study . Studies in which existing data banks were reanalyzed "as if 
th« subjects had proceeded through the test according to some other strategy than they actually did were 
classified as simulated studies. Con^uter-generatcd monte carlo studies were included in this category. The 
theoretical category included both mathematical and non-mathematical discussions of ad^tive testing 
strategies and provided somei^at of a catchall for research studies that did not seem to fit the other two 
classifications. Some studies were multiple-classified if comparisons were made between adaptive strategies of 
more than one type. 

The dimensk>n ''step-sizes" similarly required some arbitrary assignments. Tv. j-sta»e testing, for 
example, u not always structured according to fixed step-sizes, though theoretically, it could be. Nevertheless, 
this adaptive strategy was considered to be fixed step-size rather than the "true" variable step^ize strategies as 
is the case in the Robbins-Mumo technique. A study was assigned to the fixed number ofs/flges dimension if all 
examineesin a comparison group took the same number of items, regardless of the numMhr of stages involved in 
the branching strategy. ^ \ 

As shown by the left half of Table 2, about two thirds of the ad^tive testing papers reviewed were 
concerned with a fixed number of stages per test. This concentration k understandable. First, having aU 
examinees take the same number of items simplifies statistical analysis immensely, particularly when 
estimating internal consistency reliability. Stanley (1971) has shown a method for determining this index 
despite unequal numbers of items per subject, but his paper post-dated much of the reported research in 
adaptive testing. Secondly, the training of the majority of psychometricians has been under classical 
measurement theory in which all subjects are completely crossed with all items. Finally, testing large numbers 
of subjects with tests of different lengths probably had to await the development of computer-based testii« 
tcchnotogy.Thislastpointisvividly supported by the fact that 13of the 15 variable number of state studks 
reviewted have been published since 1968. ^ 

The second dimension in 'i sible .2 "step sizes," like "number of stages" was predominantly concentrated 
m one classification. Two thirds of the studies reviewed analyzed only constant step sizes. The "constant 
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TYPE OF R23EAHCH STUDY 
■Figure 1. Adaptive testing research strat^ies. 

Step-size'' categorization included both pyramidal and multiple stage tests. In pyramidal testing, items are 
grouped by difficulties over aset number of stages, while multiple stage tests include routing and measurement' 
stages with a set number of items per stage and a given number of stages for all subjects. 

The third dimension of Table 2, "O'pe of research study'' shows a fairly even distribution among 
empirical, simulated and theoretical wprk. One w6tild expect the theoretical papers to precede theempirical 
model validation studies. However, the three levels of diis dimension have been published concunently 
throughout the last fifteen years or so. 

The balance of this chapter will consider each of the three dimensions of Figure 1 and briefly summarize 
consistent results within each cell. 

Fixed Nuniber of Stages 

Constant Step Sizes 

Theoretical studies. Lord's six papers ( 1 970, 197 1 , a,b,c,d ,e) investigated the measurement effectiveness 
of both fixed-^d variable step size strategies within several varieties of fixed number of stages. His work 
utilizes the item characteristic curve thdoiy (Lord, 1972) under a specific set of assumptions which will be 
discussed in Chapter IlPof this paper. Lord's theoretical analysis of two stage testing (1971c) varied the 
mimber of iien^s presented to each subject in the routing and measurement tests, the distributions of items 
between the fwo stages and whether guessing wjis assumed to be present or not. His results were presented in 
ihtjQtfh ef graphic comparisons between the several adaptive testing strategies and a 60 item peaked 
' conventional test using information functions to evaluate the ambunt of irtfprmation yielded (Figure 2). 
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MEASUREMENT 
EFFICIENCY 



ADAPTIVE TEST I 



^^ADAPTIVE 
TEST 2 

i\ PEAKED TEST 




•3:0 -2,0 v^O .0.0 +1.0 +2,0 +3,0 
Average ^ ^ 

ABIUtY (in z-score units) 

Fig^re 2. Efficieiicy of mewurement as a function of tbUity level (after Lord, 1970; 1971a, b, c). 



He concluded that the best of the two stage strategiesprovided almost aseffectivemeasurem 
mean of the abflity continuum, with rclatjively greater improvement as a subject's abflity level departed from 
the mean abflity of the group. He found that guessing decreased Ae effectiveness of measurement for 
low-ability subjects, but affected high ability estimates much IcstTV 

Lord's theoretical development (1971b) and cvahiation (197lWof flexflevel testing was an attempt to 
implement the adaptive testing concepts under a paper and pencfl mode of test presentation. Lord's analysis 
compared a 60-item flexilcvel test v^rith a 60-item conventional test, both tests with assumed equal item 
discriminations and a third test peaked at two points along the abflity continuum. He found the flexilevel test 
superior in information provided throughout the range of abilities. As with tfie two stage testing, the 
conventional peaked test measured more ef fe<;Jively^han the adaptive test in the center of the distribution of 
scores, but the flexflevel abflity estimate was more accurate for at least 30% of the pOpiilation. Unfortunately, 
the oitfy empirical study to date of flexilevel tesUng (Olivier, 1974) foiind reduced efficiency of measurement 
throughout the abflity continuum. 

Simulated studies. Five research studies on simulated data were reviewed. These concentrated upon a 
fixed number of stages and constant step sizes. Three of these studies were made by Cleary , Unn, and Rock 
(1968a, 1968b, 1970) using 190 items from SCAT and STEP it<fm banks which were then reanalyzed as //the 



ERIC 



11 



subjects had proceeded through the item pool in an adaptive fashion. They compared seven strategies of 
4wo.5tage adaptive testing with 1 0, 20, 30, 40, and 50 item conventional tests from the same pool. They found 
ofte of the ad^tive procedures correlated highest with total score, followed by the conventional tests and then 
the rest of the adaptiye tests. The authors estimated an improvemeiit of about 35% over the best short 
conventional test on a comparable number of items by the best adaptive strategy . Validity coefficients in every 
case but one showed higher correlations with extemal criteria for adaptive tests than the conventional tests of 
equal length. " v 

Waters and Bayroff (1971) used hypothetical 5, 10, and 15 iteA convtentional tests for comparison with 
5 and 10 item branching tests, varying item difficulty ranges and the item-biscrial Lidex. Their study showed 
• that adaptive tests yielded Wghy validities than any of the conventional tests for tesU made up of items with a 
biscrial index at .60 or .80 and equal validity coefficients at a ,40 biatrial. - . ^ 

The simulated results of the Geary group and the BayrpfT group were very simUar and comparable fo the 
empirical results reported in the following section of this paper. 

Empirical studies. The eight empirical research studies reviewed by the author investigated adaptive 
testing strategies having a fixed number of items or sUges and constant step sizes. Two major varieties of 
adaptive testing have been empirically evaluated; two-stage testing and multi-stalge testing. Typically, in the 
former strategy, a routing test with a wide range of difficulties is used to assign subjects to one of several 
measurement tests with item difficulties peaked around ^ecific i^oints along the hypothesized ability 
continuum. 

Figure 3 (frorrt Bayroff, 1 964) depicts an 8-item routing test coupled with a 6-item measurement test. 
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A subject's score was determined as a direct function of the number of correct\esponscs or as a ftinction 
of the item difficulty and discrimination of those items ansivered correctly . 

In the majority of multi-stage adaptive testing research, a pyramidal model similar to that depicted in 
Figure 4 has been followed. In the example shown in Figure 4, an 8-stage strategy was utilized. All subjects 
received 8 items, bcpnning with Item 1, which was generally the item of median difficulty. The change in item 
difficulty between stages(step size) Was fixed (.OS in the example). 




1 2 3 4 5 6 7 t f 10 11 12 13 14 15 16 

low P.Talue decimals omitted HIGH 



Figure 4. Example of 8^cp pyiamidal adaptive test. (From Bayroff, 1964). 



A subject's score was based upon either the average difficulty of items answered correctly or upon the 
final item in the pyramid as shown in Figure 4. In thisexamplc a score ranging from 1 to 16 was assigned to the 
examinee. ^ ' - * 

The eight empirical studies in this cell of Table 2 reached general concensus in research results. All but 
Olivier (1974) and Wood (1969) found increases in the precision of measurement utilizing adaptive testing. 
Olivier attributed his result to unaccounted variance in the test scores possibly being caused by unfamiliarity of 
the subjects to the tlexilevcl testing format. Wood's research utUized a paper and pencil branching technique 
which, like the flexflevel procedure, likely led to a large number of subjects branching iiicorrcctly . 
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Of the eight studies in this cell, the conelation between the short ndatptive test scoies and the longer 
cpnventionil scores were in the JS to .86 range ivith the exception of Wood's pooled results showing only a .5 1 
relationship. As a groufp, these studies tended to recemmend further research in adaptive testing be centered in 
mechanical or computer-based modes of presentation rather than the traditional paper and pencil method. The 
five papers utilizing such equipment all suggiested further research in the area of adaptive testing. 

Lord piesenU a discussicm of tailored testing theory in general in Holtzman (1970). He provided a brief 
description of item characteristic curve theory, information function theory, several strategies of step size 
variation, several suggested scoring methods for'taSored testing and varied number of items. He included in the 
final section of the paper the following caveat : 

If» for example, 500 items are avaOable for tsiloied testings better measurement will often be obtained by 
' selecting, for examine, the S*€0 most disciiminatine jtems (Mghest aj and administerii% these as a conventional 
test, rather than using all 500 in a tailoied testing procedure. This may actudiy prove to beafmtdobiecthn to 
my fOKmlux of ioMored testing. (!EmphmhoiA*s), 

Itis thejudg^nt of the authorofthispaper thattheLord(1970)p^rshouldbeessentialieadingfor 
any researcher interested in adaptive testing. Althou^ 'the majority of adaptive testing research reported to 
date appears promising. Lord's warning should be kept in mind when evaluating the effectiveness of any 
aidaptive testing strategy. 

N(ariable Step Sizes 

Theoretical studies. The majority of the theoretical rest arch into a fixed number of stages and variable 
step sizes has been under the Robbins-Munro branching mle. Stocldng(1969) andLord(197p, 1971c, 1971d) 
have analyzed the Robbiiu-Munro technique in comparison with the mor^ conventional up*and*down method 
described by Lord (1970). Essentially the Robbins-Munro, or sd<aned shrinking step size method, presents an 
item of median difficulty (bi ) to begin the test . If item b i is correctly answered, item b^ is selected thusly : 

bi+,=bj + di(Ui-d) (1) 

(From Lord, 1971c) 

where: d| , dj , da , . . . is a decreasing sequence of positive numbers chosen in advance of testing. / 

b I = difficulty of the i*** item. 

U| = 1 , if item i is answered ogrrectly and 

Uj = 0, otherwise 

d and d are positive numbers to be chosen prior to testing in order to produce good measurement 
properties on the final test scores. 

The fixed step size methods discussed earlier determine the difficulty of the (i -i* l)th item by a constant 
increment independent of i: 

bi+i=bi + 2d(Ui-3) (2) 

Lord (1970, 1971c, 197ld) compared these two step size strategies and found that the shrinking-step 
sizes provided better measurement than several varieties of up-and-down methods. A major deterrent to the 
dirinking-step sizes was reported. The up-and-down method requires an item pool of on^ n (a -t* 2)/2 items(for 
a IS item test, for example, a 120Htem pool is necesury) which is reasonable in most large scale testing 
situations. To use the Robbins-Mimro strategy, 2^ - 1 items should be available (32,767 items for a IS item 
test) literally an impossibility in any situation. Since both empirical and theoretfcid resei^h (Wood, 1971 ; 
Novick, 1969) have shown with a remarkable degree of consistency that adaptive testing is most effective 
between 1 S arid 20 items per test, the Arinking-step size methods as now conceptualized are not feasible in the 
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ltd world despite their theoretical superiority* hi reality , Lord found this superiority to be relatively small and 
recornmended use of the fixed step-size procedures rather than a Robbins-Munro whenever the number of 
items exceed six. . - 

, Lord (1970) andStocking(1969)alsoinvcstigatedthepersistent problemofhow toscoreadaptive tests. 
Since different subjects may take different collections of test items in different orders, the conventional 
practice of rights-onfy or rights-corrected-for-guessing is clearly inappropriate. Ix)rd's theoretical research 
showed that scores based upon the Iverage difHculty of Items answered correctly was superior to scoring 
methods based upon the difficulty of the final item passed or of the next item that the examinee would have 
taken. Conceptually, the latter two methods appear sound, since the estimate cf the subject's true ability 
should improve as more items near the subject's JO probability level are presented. If the subject's true score is 
far from b i , the author would expect the early items faced by.the subject to adversely affect average difficulty 
scoring methods. Certainly this area of adaptive testing remains to be empirically evaluated beyond Lx)rd and 
Stocking's hypothetical investigations. 

Simulated studies. Paterson's (1%2) monte cario study evolved from the sequential item test (SIT) of 
Krathwcrfd and Huyser (1956). A six-item "conventional test and six-item pyramidal test were created with 
1 SOO "examinee'* scores generated at 1 S different ability !f vels 0 00 each level). Unlike all of the other studies 
of adaptive testing reviewed , Paterson selected items based upon biserial correlation rather than ea^dusively by 
item difficulty. He ordered the items in the pool by difficulty and by rj^j^ within difficulty levels. Step size was 
thus a function of item discriminatiqii, approximating a shnnK:ng-step size model since Itfger steps were taken 
for eady items and shorter step sizes f^r later items. He scored his tests based upon the final difficndty method. 
His results diowed the adaptive test to better reflect non-normal ability distributions and to better ntf asure 
examinees with abilities in the extremes of the distributiooi As with Lord, Paterson found measure^nt 
efficiency slightly inferior for the adaptive strategy near the mean of the scoredistribution. He recommended 
that adaptive item pools required a more flat distribution of item diffic ulties than the conventional test. ^ 

The only other simulated study of fixed humber of stager^!tli>^ble step size adaptive testing was 
done by Bryson (1972). She compared two 5- and 104tem active measures with two 5-item conventional 
tests with a validity coefficient based upon a 100 item parent test serving as criterion. Her results did not favor 
the adaptive procedure ; however, several methodological enors involving branchingj^ scoring and the fact that 
the control group tested via paper and pencil while the experimental group used a cathode ray tube (CRT), 
suggest the discounting of her re^ts. 

Bryson further compared her empirical results described above, with two groups of test scores of 100 
recruits wldch were rescored as i/they had been taken sequentially as Cleary , Linn, and Rock ( 1968a, 1 963b) 
had done eariier. The correlation of these four group scores to the parent test yielded one group with higher 
adaptive correlation and one with higher conventional correlation. Such a result leads one to question the 
procedure of using "real data" from data banks for simidations of adaptive test results. Apparently, an 
interaction effect exists between item order, item selection and/or examinee response which invalidates this 
type of simulation deagn. \ ^ > 

Empirical studiex The aforeirtif>^oned paper of Bryson (1971) and two studies by Bayroffs associates 
(Seeley, Morton, and Anderson, 1^62; Bayroff, Thomas, A Anderson, 1960) comprise the only reported 
empirical studies of adaptive testing With variable step sizes. The Bayroff studies incorporated one unique twist 
in adaptive research in that branching from the first item was b^<ed upon not only whether thie subject's 
response was correct or not, but also upon the incorrect responses. The attempt at utilizing the "partial 
knowledge" information available to discriminate between examinee ability tevels has been extensively 
investigated (review by Stanley & Wang, 1970) oh an entire tekt basis with increases in test reliabilities and 
decreases in test vabdation generally reported. The major problem appears to be finding enough "good" items, 
an with "good" distractors to comprise a test. Under the Bayrof^roup's strategy, onl^ one or two such items 
would be required, whkh seems much more feasible. Such an approach seems to be worthwhile for further 
investigation. \ 
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Results from the Bayrof f studies showed a .63 correlation for a 6-item adaptive test with a parent test 
whSe a 25*item conventional test correlated significantly higher with a parent test. The authors noted that the 
distribution of item difficulties was badly skewed to the left with a resultant skewed score distribution. In 
addition, the adaptive tests involved tonger construction, administration and scoring time and resulted in more 
unusuabk answer sheiets than the conventional tests. These results arc consistent with the Wood (1969) and 
Olivier (1974) results using paper and pencil adaptive tests. Apparently, a mechanized mode or presenUtion 
should be used for any adaptive testing to avoid examinee branching enors. 



Variable Number of Stages 

Research studies on adaptive testing involviiig variable numbers of stages faD under the category of 
decision theory. In these studies, testing was terminated when a preset criterion was reached. Commencing 
with the work of Wald (1946) and the Statistical Research Group (SRG) and carried on by Cronbach and 
Gleser (1965), sequential anaiyslsiechniques entailed presenting an item or block of items to a subject, after 
which a decision is made to (a) assign the subject to a "passing** group; (b) assigii the subject to a "failing** 
group; or (c) continue testing. All of the research done within the variable number of stages level essentially 
follow the sequential analysis model in determining a stopping rule for testing. 

Conceptually, varying the number of items presented between subjects makesscnse.Scttingaparticular 
number ot items for all subjects fails to account for individual differences between subjects and certainly must 
be wasteful for a percentage of the examinees. The catch, of course, is in determining when to cease testing for 
each subject and handling the problems wliich aifise when«7(aminees do not take an equal ntmiber of items. 

Constant Step Size 

Theoretical studies. About a decade after the previously dted work ofWaWan*dtfie SRG, Cronbach and 
Gleser's (1965) book. Psychological Testt jtnd.PersQntiel Decisions,*^ piet^nVtd a coihplete theoretical 
exposition of efficient testing procedures. They introduced the concept of cost effectiveness and concluded 
that, theoretically, testing efficiency will be maximized by complfstely adapting the test to the individual 
testee. Green (1970) reiterated the cost effective point in responding to Lord's (1970) caveat concerning 
adaptive testing. Kappauf (1969) described an application of the up-and-down method of branching using a 
sequential analytic stopping rule for computer-based p^chological testing, although no resists were reported. 
No further theoretical research was found until Weiss (1973) presented his model he termed "stradaptive 
testing," produced under a research grant from theU.S. Navy to invejttigate computer-based adaptive testing 
for possible Navy in^lementation on a large scale. Weiss and his associates are in the process of comparing 
two-stage, Bayesian, pyramidal, fkxilevel and stradaptive'' testing strategies witii one another and with 
conventional testing. DeWitt and Weiss (1974) published a description of>an elaborate computer software 
system for making these comparisons and McBride and Weiss(1974) produced a descriptwn of the mechanics 
of creating an item pool/or adaptive testing research. Since Weiss' stradaptive model is tfw target of this present 
study, the description of the model will be held until Section III of this paper when a complete definition of the 
elements of the model will be made. 

Simulated studies. The author found only one simulated study involving constant step sizes and a 
variable number of stages. Liim, Rock, and Qeary (1970) reanalyzed 1967 CoOege Level Examination ftogram 
(CLEP) data from English composition, mathematics and natural sciences examinations. They simulated two 
adaptive testing strategies, one in which three CLEP tests wejre analyzed separate^ and the other in which the 
mathematics test score was used in the decision procesi for tht.Enf^sh and science tests. Essentially, Linn^f 
j/., fdlowed the sequential anafytic proceduressuggestedby Wali, although^ specific model was developed 
by Armitage (1950). They also scored short conventional tests of the first 5, 10, 15,20, 25. ^0, 35,40,45, 50, 
55, and 60 items for comparisons with the adaptive tests. 

Linn, ef-j/., results showed substantial improvement in assignment of subjects to oneof two groups for 
dichotomous decision making. They estimated that the shc^rt conventional testt required approximately twice 
as many items to achieve a comparable level of accuracy asthat achieved by the adaptive tests. To the author's 
\ knowledge, no empirical study has been conducted to verify ttiis impressive result. Su/ch a study is warranted, 
since other **real data'' simulation results have not replicated emprri^ 
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Empirical studies. No published research on adaptive testing with constant step sizes and a variable 
number of stages was found by the author with the exception of examples of stradaptive records reported in 
Weill (1973), Wci» ii presently investigating this area and hai adviicd the present author (personal 
communication) o f some aspects of hii reiulti. test-retest reliabilities on ten different scoring methods 
have been in the range of .72 to .93 for a method which branches the subject to an easier item whenever he 
either misses the previous item or responds with a question mark, Weiss* alternate strad^tive testing model 
(which is the model uicd in the present study) presents another item of equal difficulty after ^question mark is 
entered in rc^nsc to an item. Hii resulting testnretest reliabilities using this model have been consistently 
about .10 lower than that from the other model 

Two empirical studies have been made (Cowden, 1946; Moonan, 1950) which verified the sequential 
analysis application in testing. However, the tests used were presented to the subjects in a fixed order, with 
only the number of items being preicnted hieing varied. This strategy is not adaptive testing, per sc. Thus, these 
two studies have not been included in Table 1 . The favorable results do provide evidence that an increase in 
testingefficiency is possible by adapting the number of items on a test to the individual subject. 

Variabfe Step Sizes 

Models in which botii the number of stages and step sizes are variable generally fall under the heading of 
Bayeiian testing. All reported work in the area has been published during the last five years. Computer 
implementation seenis essential since the selection of each item for a given examinee takes into account all 
previous responses. A criterion is established such as to minimize measurement error by providing an estimate 
of the subject's ability. This estimate is a weighted average of the norming group's performance on an item and 
the subject's performance on the items taken up to that item. 

fheoreticd studies. Two models have been suggested for implementing the Bayesian formulas in 
adaptive testing. Novick (1969) and Owen (1969, 1970) have produced radically different models which 
appear to be conceptually appealing. The complexity of tfie Bayesian models prohibits lengtiiy description in 
this paper. However, some of the results have direct application to more conventional adaptive testing. Novick 
(1969) anticipated Bayesian testing to be particularly advantageous for testsJOf 15 to 20 items of length. This 
result has been supported in the fixed number of stagelaf empirical studies reviewed earlier and also in Wood's 
(1971) empirkal study of Owen's model. This consistency of resultsin the adaptive testing literature provides 
strong evidence of the potential savings in the number of items required in adaptive testing. 

Simulated studies. Urry (1970, 1971) has reported two monte carlo studies of a model based upon a 
logistic test model. Like the Bayesian models, Urry*s strategy chose items in order to minimize the standard 
error of the estimate of the subject's ability. Unlike Bayesian testing however, Urry 's model utilizes maximum 
likelihood estimates calculated after each item to estimate ability . 

Urry varied itenvability biserial correlations, number of items, difficulties, the guessing parameter and 
the shape of tiie distribution of Item difficulties to generate 36 item structures.His criterion was validity in the 
prediction of the scores of 100 hypothetical "subjects" of known ability levels. 

His results showed his adi4)tive tests to be increasin^y effective when item discrimination increased, 
particulariy when a broad range of item difficuUiei was used. In such a situation, he found a lO-item adaptive 
test to be as effective as a 304tem conventional test. He also suggested that adaptive teiting not be used when 
the probability of guessing an item conecUy is .50, as in a true-false test. 

Urry's results also indicated that when high item discrimination indices were coupled with a rectangular 
distribution of difficulties, a 104tem adaptive test produced as high a correlation between known and 
estimated ability as a 100*item peaked test. When he analyzed the results with item discrimination set at .45 
such as Lord (1970, 1971) used, his results confirmed Lord's less dramatic conclusions. He concluded tfiat 
adaptive testing be used when item ability biserial correlations are .65 or larger. Unfortunately, a large pool of 
items above this criterion would be most unusual in the typical ability testing situation. If such a minimum 
j(tandard is necessary for adaptive testing to be empirically effective, this fact alone could toll the death knell 
for this testing strategy. 
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Urry's second study (1 97 1) used the same model as his earlier dissertation. He generated three item 
banks and fit the data to the model. He determined that Bayesian testing of the Verbal Scholatic Aptitude 
Test (VSAT) could save 65% of testing time for the average examinee. 

Kalisch (1974) used the beta distribution and conditional item difficulties to predict. subject responses 
on items beyond those he actually took. A sequential decision rule was used to determine vAien to cease testing 
based on an expected loss function to the subject between the three possible decisions (item response would be 
correct, no assumption, or response would be incorrect). Results were reported as favorable to future research 
into this model. 

Empirical studies. Four empirical studies of fully adaptive testing have been reported. Wood (1969) 
conducted an empirical validation (number of subjects only 28) of Owens' (1970) model along' with a 
simulated study as part of a dissertation. In the simulation portion of the study, he compared his Bayesian 
results with a 604 tem simulated two-stage test and a 60-item conventional test. The empirical data diowed the 
Bayesian ability estimates to converge arouhd 20 items, remarkably similar to Novick's(1969) theoretical 
prediction with a different Bayesian testing model. In the simulation portion of the study. Wood found both 
Bayesian and two-stage testing to be superior to conventional testing, with the two-stage performing better 
than the Oweii model in terms of measurement preciseness, although the Bayesian method was more cost 
effective. A saving of 2/3 of the number of items required for the conventional and two-stage tests was 
evidenced in the results of the Bayesiaii strategy . This result also supported Owens' theoretical savings. 

Ferguson's dissertation (1969) and a later paper (1971) report a model development and empirical 
validation for a computer-assisted, criterion referenced instructional system. The purpose of his research was 
to apply the sequential analytic techniques of Wald to the decision of mastery or non-mastery of instructional 
objectives within a hierarchially-structured domain of achievement. After c xh item response a decision wa| 
made to classify the student as having mastered the material, not mastered it, or no decision (present another 
item). Testing continued until a decision was reached for all students. The computer then selected the next 
objective for each subject based upon previous performance. 

Ferguson's results were very favorable to the adaptive approach. Both test-retest reliabilities and 
validities were higher than a conventional paper and pencil mode of presentation and a 50% time savings was 
reported on the computer-based measurement system. 

AsCieen suggested (1970) and Ferguson's research coitQrmed, the use of adaptive testing as a strategy 
for instmctional management rather than as a measurement tool may turn out to be the most effective 
application of the adaptive models. The instructional situation is immediately concerned with decisions about 
a single subject and the oft-mentioned lack of efficiency of the adaptive strategies near the center of the ability 
distribution should not be entirely relevant in this context. Further resea|c;h into instmctional applications of 
adaptive testing is warranted. 

Summary of the Literature on Adaptive Testing ^ ' 

The foDowing conclusions appear warranted based upon the studies in this review : 

1. Item pool distributions ofdifficulty and discrimination values have alftrge effect on empirical results 
in adaptive testing studies. WeDnormed item statistics wi th appropriate distributions are essential for empirical 
studies. 

2. Average difficulty scoring methods are superior to final difTiculty methods. ^ 

3. Within the fixed number of stages dimension, the .up-and-down method is superior to the 
Robbins-Munro method due to the number of items required in the item pool. ' 

4. At least with the models developed to date, paper-and-pencil adaptive testing is npt likely to produce 
favorable results. Use of a computer great^ enhances this measurenient strategy. 

5. Although an efficient method for analyzing a model, /'real data** simulation studies should be 
followed up by empirkal validation. The change of item sequencing, item content and test length in adaptive 




testing apparently affects examinee performance. This change, at least in the studies reviewed, was 
consistent-the simulated studies were far more favorable to adaptive testing than the empirical validations of 
the same model. 

6. Theoretical studies need to consider item parameters more closely attuned to the reality of 
measurement. Although assumptions of no guessing, all items having equal difficulty or discrimination indices, 
etc., simplify analysis, the results of this type study are not generalizable to the world of testing. Follow-up 
validations are essential. 

7. Group indices such as reliability and validity may not be appropriate measures of the effectiveness of 
aaaptive testing. An information function as described by Lord seems preferable. 

8. A fully adaptive model in which both the number of items presented and a variable step size should 
produce the greatest gains. 

9. A large reduction in the number of items necessary foreffective measurement seems probable using 
adaptive procedures. 

10. Adaptive testing shows promise as an effective, feasible alternative to conventional testing. 

HI. THE STRADAPTIVETESTING MODEL 

» 

Lord's theoretical analysis of adaptive testing versus conventional testing made one point very clear . . . 
a peaked test always provided more precise measurement than an adaptive test of the same length when the 
testee's ability wasat the point at which ffie conventional test was peak&i. As shown in Figure 2, at some point 
on the abflity Continuum, generally beyond about + .5 standard deviations from the mean, the adaptive test 
requires less items for comparable measurement efflciency. 

Lord's conclusion suggests that an **ideal" testing strategy would present a collection of items to each 
subject comprising a peaked test with a .50 probability of a correct answer for examinees of the particular 
subject's true ability (P^ = .50). The catch, of course, is that the true ability of the subject is unknown; the 
estimation of which is, in fact, the desired outcome of the measurement procedure. 

Traditionally, this problem has been circumvented by peaking the test at = .50 for the hypothetical 
average ability level subject. This procedure worked well for examinees near the center of the ability 
continuum, but lessefflciently near the extremes. 

Weiss and colleagues at the University of Minnesota have developed and begun validating a model ^ 
designed to combine the best of both of these two competing measurement strategies. They have combined the 
underlying philosophy of the Binet-Simon IQ measurement with the work of Lord to produce their so-called 
stradaptive testing model (stratified adaptive). The Binet testing procedure'began testing at an "entry point" 
on the ability continuum judged to be appropriate by the examiner. He presented a short sub-test to the subject 
which was peaked around P^ = .50 for subjects of a comparable "mental age." Baaed upon the subject's 
proportion of correct responses to the first sub-test the exanuner selected the next peaked sub-test which had 
an average P^ = >50 for groups of respectively higher or lower mental ages. 

The Binetstrategy defined two subtest levels for a subject. Mihe early testing, the examiner searched for 
the subject's "basal age," that is, the peaked test in which tnl examinee answered all items correctly. 
Determinatton of an examinee's basal age assumed that any less difficult peaked tests would also be below the 
subject's true ability level, thus providing a lower bound on the true ability estimate. Once the basal age is 
found, the Binet examiner selects progressively more difficult subtests until the subject's "ceiling age" is 
defined. Hie ceiling age was determined by the subtest in which the subject incorrectly responded to all items. 
Testing beyond this difficulty level would only frustrate the subject, reducing the precision of measurement. It < 
was assumed that any item more difficult than the subjecf s ceiling level would simihrly have been answered 
incorrectly . The items between the basal andceiling ages provided accurate ability estimation for the subject. If 
the subtests had been properly normed, the subject's proportion of correct responses within the subtests he 
had taken should decrease monotonically from 1 .00 at his basal age to 0.00 at his ceiling age. The best estimate 
of his true ability would be a function of the difficulty of that subtest in which his P^^.50. 

WeisK* stradaptive model extends this Binet rationale to computer-based ability measurement. A large 
item pool is used A^ith the item parameter estimates based on a large sample of subjects from the same 



population as the intended examinees, the itenis are scaled into a set of peaked levels (strata) according to their 
difficulties. The subject's first item is selected based upon a previously collected ability estimate or the 
subject's own estimation of his ability on the dimension being assessed. 

As in the Binet, the subject's basal and ceiling strata are defined, with testing ceasing when the ceiling 
stratum is determined. A subject's score is a function of the difficulty of the items answered correctly. 

TheltemBank 

A stratified, assumed unidimensional, item pool is required for a stradaptive test. Items are organized 
into a number of strata peaked at different difficulty levels. 

Weis$(1973) lists four steps in the creation of the item pool for a stradaptive test. 

1. Test a large number of subjects on a large number of items which measures an hypothesized 
unidimensional trait. 

2. Compute item difficulty and discrimination indices on all items in the item bank, in either traditional 
P'values and item/total score correlations or usinglatent trait theory parameterestimates derived from normal 
ogive item assumptions (Lord & Novick, 1968), The latter alternative is preferable if the assumptions of Ihe 
nomial ogive model can be accepted since, theoretically, the estimates derived from this model are not 
contingent upon the frequency distribution of ability of the total group. That is, the item characteristic 
function is the same for any grou p of examinees on the unidinwnsional trait of concern. Two assumptions 
underlie latent trait theory: 1) the latent variable space is one-dimensional (K = 1) and 2) the metric for the 
ability continuum (d) can be chosen so that th<f item characteristic curve for each item g = 1,2, . . .41 (the 
regression of item score on 0) is the normal ogive 

Pg(0)^Pg(0,ag,bg)^cKy0))^ / ^ «t)dt-y f(t)dt, 

^ — Lg(0) 

where 

LgW'^^gC^-V 

is a linear function of 6 involving two item parameters ag and bg, and <Kt) is the normal frequen9y function. Sec 
Lord and Novick ( 1 968) chapter 1 6 for further discussion of thl normal ogive model and latent trait theory. 

3 . Assign the items in the pool into I independent strata, where each stratum is a peaked test of J items 
with no overlap of item difficulties between adjoining strata. The number of strata, I, depends on the size and 
distribution of item difficulties, with- the precision of fneasurement approaching equality throughout the 
distribution of ability levels as I increases. Figure 5 depicts the item pool stratification plan. 

Weiss recommended that a minimum of 10 to 15 items per stratum appeared appropriate and that 
experience with the model suggested more items be placed in the lower and middle difficulty strata than at the 
upper strata. ^ 

4. Arrange the items within strata by discrimination index from top to bottom ineach stratum. Since 
items taken earlier in a stratum should reflect a wider range of abilities, fmer discrimination is not required. 
Items lower in a stratum should be reached when testing is confined to only a narrow range of abilities and 
"fine'' discriniinJition between ability estimafesisnetessary^^ ^ 

Table 3 shows the actual distribution of itei^ used in thisexperimcnt. The final pool included 244 items 
grouped into 9 strata according to normal ogive it<!/m difficulty parameters as shown in Table 3. 

Figure 6 shows the relationship between and Bg parameters in the stradaptive pool. As is typical in 
educational and psychological research, the concentration of more difficult items contain the lower 
discrimination values. The correlation between bg and a^ of -.3 1 reflects this problem. Selection and rescaling 
procedures will be desc ribed in Chapter four of this paper. 

The nine straU in Table 3 are csscntiaDy nine peaked tests varying in average difficulty frttn ~ 2. 1 2 to + 
1 .9L Stratum 9, the most difficult peaked test, forexample, was composed of 19 items ranging from b^ = 1 .27 
to bg - 3.68. The order of items within a stratum was random, unlike' Weibi' model, in order to pmnit an 
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p 1.00 90 .80 70 .60 .50^ 40 30 .20 JO .00 
Easy items DIFFICULTY Difficult items 

(p = proportion correct ) 

9 

Ftguft 5, Distribution of item, by difficulty level, in a itiydaptive test (fioni Wein , 1973). 
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Figure 6. Scatterplot of relationship between Ag and Bg. 

altematC'forms reliabflity coefficient to be calculated on stradaptive examinees. Personal discussion with Weiss 
led to the conclusion that the randomized design utilized in this study would not jeopardize the feasibility of 
the sti adaptive testing procedure. Theoretically, this design could have added a few items to some examinees' 
tests, although abflity estimates should^^aye been similar to Weiss' procedure estimates. If a bias were caused by 
this change, it would make the results from this study less impressive than they might be otherwise in a 
compar^on between stradaptive and conventional testing. 

Item Content and Foniiat 

All items in the item pool weie in the following form: . ' 

EXAMPLE: Calf: Cow: 

a. puppy: y dog 

b. nest: / bird 

c. house: / build 

d. shell: turtle 

These test items were selected for this study for a number of reasons. First, the SCAT Series II provided a 
single-format unidimensional test with extensively-normed item parameter estimates. The item format was 




easily st< red in the computer itemiile, being short and standard for all 244 items. SCAT II was well received in 
Buros' 7th Mental Measurements Yearbook (1971) with intemal consistency reliabilities for the five SO-item 
forms ranging from .86 to .88 and validities comparable to other leading measures of verbal aptitude. 
Administration was relatively short (20 minutes for ^e published test) and, finally, ETS consented to provide 
the items and item parameter estimates for this research.' 

Computer Progimm for Modd Implemeptotion 

1. A computer program fully described by DcWitt and We^iss (1974) was adapted by James Sutherland of 
Rorida State University to fit the FSU Control DaU Corporation 6S00 computer.^ 

Instructional Sequence 

The DeWitt and Weiss program was written so that it could be tiscd by subjects with no prior 
cathode-ray-tubc experience and with no help from the examination proctor. The proctor simply typed a 
single letter into the CRT to select stradaptive or conventional test, and the instructional sequence began. The 
subject was asked to type in his social security number and name and was instructed in the use of the CRTai|d 
in the nature of the research. A sample item was presented and responses to the questions in Figure 7 were 
requested. 



Everybody is better at some things than 
others . . . Compared to other people your 
age, how good do you think your vocabulary 
is? 

Better than: 1 out of 10 1 
2 out of 10 
3outof 10 

4 out of 10 

5 out of 10 

6 out of 10 

7 out of 1,0 

8 out of fo 

9 out of 10 

Type in the number from 1 to 9 that gives 
the number of people you would guess you 
are better than (in vocabulary). 



Entry Stratum 
(not seen 
by examinee) 

1 

2 

5 

6 

7 

8 

......9 



Figure 7. Entry point question for determining subject ability estimate 
(from: Weiss, 1973). 



After completing this Usk, the subject typed in the word "start" and the testing sequence began. 

Testing Sequence ^/ 

The response to the question in F^rc 7 determined the subject's entry point (ability estimate) in the 
stradaptive item matrix. The flrst item the stradaptive* subject recr^ived was the flrst item in the stratum 
commensurate with his ability estimate. The subject was then branched to the first item in the next higher or 
lower stratum depending upon whether the initial re^onse was correct or incorrect. If the subject entered a 
question mark (?>, the next item in the same stratum was then presented. 



The^test materials from the SCAT Series II Verbal Ability tests were adapted and used with the permission of 
Educational Testing Service. The author of this paper gratefully acknowledges the help of ETS in the pursuit of this 
research. c - . 

^De Witt's help m the conversion of his program from the University of Minnesota system to the Flonda Sute 
University system is gratefully acknowledged. Under the time constraints in this study, program operation prior to data 
collection would not have been possible without DeWitt's advice and efforts in our behalf. 
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Testing continued until a subject's ceiling stratum was identified. For this study , the ceiling stratum was 
defined as the lowest itratum in which 25% or less of the items attempted were answered correctly, with a 
constraint that at least 5 items betaken in the ceiling stratum. The 25% figure reflects the probability of getting 
an item right by random guesa^g on a 4-option multiple choice test. Once a subject's ceiling stratum was 
defined, the program loopedL^nu^k to the examinee's ability estimate stratum and commenced a second 
stradaptive test with item selcctioto continuing down the'item matrix from where the flrst test ended. Since 
items were randoinly positions within each stratum, parallel, alternate forms were taken by all subjects who 
reached termination criterion 6n the first test. 

A maximum of 60 items per subject per test was established, as pre-study trial testing suggested that 
subjects became saturated beyond this point. 

ScoringMethods ^ 

Weiss (1973) suggested ten possible scoring methods for stradaptive testing. These scoring methods 
equate item difficulties to ability estimates through the scaling to normal ogive parameters, assuming a 
unidimensional continuum underling the item pool. 

Most of Weiss' scoring method suggestions were used in this study unchanged. The item scoring methods 
can be classified into three tyjpes: item scores, stratum scores, and average difficulty scores. 

Highest Item Difficulty Scores. Three scoring methods are based cm the "hurdle" concept in ability 
measurement: that is, the height (difficulty) of the highesthurdleasubject can jump. Thus, asubject'sability 
can be estimated as: 

Method 1 . The difficulty o f the most difficult item answered conec tly . 

Method 2. The difficulty of the n + 1 th item (the next item that would have been {^resented if testing 
continued). 

Method 3. The difficulty of the most difficult item answered correctly below the subject's ceiling 
stratum. . 

Stratum scores. Since the stradaptive pool can be considered a series of peaked tests, the average 
difficulty of the items within each of the strata is a measure of examinee ability for subjects whose ability lies 
within a strata. This rationale suggests four stratum scoring methods similar to methods 1 through 3. A 
subject's ability score can be estimated by: 

Method 4. The average difficulty of the highest stratum in which at least one item was answered 
correctly. 

Method 5. The average stratum difficulty of the n + 1th item. 

Method The average item difficulty at the stratum just below the ceiling stratum. 

Scoring method 7 (the interpolated stratum difficulty score) weights method 6 by the at the highest 
non-chance stratum, thus resulting in a continuous range of ability estimates. 

Method7. Thisscoringmethodisdefined as: 

A = D^_l+S(P^„l-.50) 

where 

1 ^c'l ~ the average difficulty of the 

c— 1 th stratum where c is the 
ceiling-stratum 

P^^j - the subject's proportion answered 

^ conectly at thee- 1th stratum 

and S = *^c-Vl, ^f^l > 

or S = Vl-'Vi, ifPc-l< -50 

D = average difficulty o f the designated stratum . 
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This scoring method makes . ^^umption that the subject!^s ability lies at the mean difficulty of » 
peaked test (stratum) if exactly 50% of the items are answered correctly. Ability is estimated proportionally 
between the midpoint of his C- 1 and strata. 

Unlike the other 3 stratum scoring methods^ method 7 results in a hypothetical continuous range of 
possible scores along the entire continuum o f ability . 

Average difficulty scores. Three possible scoring methods are analogous to Loid's average difficulty 
methods. They estimate a subject's ability to be: 

'Methods. Theaveragediffkultyofallof the correctly answered items. 

Method 9. The average difficulty of all items answered correctly between the basal stratum and the 
ceiling stratum. * < 

The scoring of method 9 was redeflned in this study from Weiss' original definition. As specified by Weiss, 
method 9 was not usable when basal and ceiling strata were adjoining. When this result occurred in the present 
study,scdre 9 was defined as: > 

A=Di, + S(Pb) . 

where D|^=average difficulty of items answered correctly in basal stratum 

and S=D^-D,_, ' 

Method 1 0. The average difficulty of items correctly answered in the highest non-chance stratum. 

Two other revisions were made by the author to Weiss' scoring suggestions. If no basal ceiling was^ 
established (i.«., no stratum emerged widi 100% correct re^nses), it was assumed that the subject's basal 
stiatum lay inunediately below the lowest stratum with a correct response in it. Siniilarly, if no ceiling stratum 
was defined (i.e., the subject scored above 25% correct in all strata utilized), the subject's ceiling strata was 
assumed to be inunediately above the highest non-chance stratum. 

The author made one other change in the Weiss model. Weiss had reported (1973) a problem wherein 
subjects of extremely high ability "topped out" his test and answered a high percentage of the presented items 
in stratum 9 correctly . Hence, an amendment to the 5 item/25% termination criterion was needed. 

Since the probability of a subject of tnie ability leu than the average difficulty of stratum 9 correctly 
answering a stratum 9 item is <.50, the joint probability of such an individual correctly answering 5 items in 
stratum 9 in a row is < j05, the alfriia level used throughout this research. Therefore , whenever S items in a row 
were correct in stratum 9, testing was temiinated. The subject's basal stratum was not affected by the earlier 
termination, but his ceiling stratum became ^stratum 10," whose mean difficulty was: 

' D,o«D,+(D,-D.) 

where Dp mean difiiculty«ofall items in stratum i 

This change resulted in ability estimates for examinees in this category theoretically ranging from 2.27 to 
3.75 for scoring methods 9 and 10. Such ability estimates woiild seem to be appropriate for subjects 
demonstratirig such a strong response! pattern. ^ 

termimtion Rules 

As indicated earlier, Weiss had two versions of his stradaptive testing computer program. Version one, 
which was used in this study , presented another item in the same stratum when a subject skipped ^n item. 

. Theaiithor of this study was unaware of the existence of the second brwching strategy program prior to 
comi^tion of data collection. However Weiss' program procedure of ignoring skipped items in determining 
test termination was questioned. It appeared that valuable information was being loit when the Weiss 
procedure was followed. 

It was reasonable to expect that a subject would omit an item onjy v^ich he felt he had no real 
knovrfedge of the correct answer. Thus, investigation of thetenscoringmethodswith termination based upon 
omits counted as wrong answers wu judged q)propriate. 
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Weiss had set S items in the ceiling stntum as the minimum constraint upon temiination. A secondary 
goal of the present study was to determine what effect the reduction of this constraint to 4 would have upon 
the efTectiveness of the 10 scaring niethods in the stradaptive tests. 

Theije two questions of the handling, pf omits and the variation in the constraint on the termination of 
testing aeated the following three methods for comparisons: 

Termination Method 1: Omits ignored/constraint -5 items 

Terminauon Method 2: Omits^'wrong/constraint-Sitems 

Termination Method 3 : Omits * wrong/constraint = 4 items 

Data was coOectediising termination Method 1 and then rescored using Methods 2 and 3 for eadi of the 
10 scoring methods. This was possible since no indication of the termination of the fust test was given to the 
subject and |ince items were randomly ordered, within strata. Once test tennination was reached using 
termination Method 2 or 3, the next item taken by the subject in his entry point stratum acted as die start of a 
parallel forms test under the termination rule used. 

Of course. Method 2 required less items than Method 1 and Method 3 considerably kssthvi Method 2. 
The thrust of thishivestigation, then, wasjto determine the relaitive efficiency of the three methods in 
comparison with one another and with linear testing after equalizing test length using the Spearman-Brown 
prophecy formula. 

Stiadaptive Test Output 

Figure 8 provides an example of a stradaptive test report from this experiment. A next to an item 
indicates a correct response; a ' an incorrect response, and shows that the subject omitted the item. 

Theexanitiee in Figure Sestimatedherability as **S.'*Hence,herfirstitemwastiiefirstitemin the fifth 
stratum. She correctly answered this question, but missed her second item, the fint item stored in the 6th 
stratum. She skipped the next item, and after responding somewhat inconsistently for the first nine items, 
''settled down** with a very consistent pattern for items 10 through 19 when she reached stopping^nile criterion 
and her fint test terminated. 

At this point in her stradqitive testing, the testing algorithm selected the 6th item in stratum 5 O^gr 
ability estimate) to commence her second test. (The subject was totaUy unaware of this occurrence, as no 
noticeable time delay occurred between her 19th and 20th items). 

At the conclusion of her 31st item, thii subject reached termination criterion for her second test, was 
thanked for her help in this research project, and given her score of 1 5 correct answeis.out of 3 1 questions with 
a percentage correct of 48 .4%. . . , 

The scores for this subject are shown for both tests. The interested reader may gain a more thorough 
undeistanding of the scoring methods used in this model by tracing this subject's ability estimate scores 
throu^Table3. 



IV. FROCEDURES 

Item Pdol Conetiuction 

Item pool data received from Educational Testing Service entailed five 50-item verbal analogy tests. 
Forma 1 A, 1 IC, 2A and 2B of the SCAT Series n examinations. These tests had been nationally normed on a 
sample of3133 twelfth grade studenU in October, 1966. The five tests were not of equal difflculty^u shown 
byTable4,withtest ICconsiderablynioredifficuhthar.the other4tests. 

P-valiies and biserial correlations were provided by ETS on 249 of the 250 items on the five forms, 
exchiding item number ISO, statistics for ^ch were not available. Upon inspection of these indices, item 
number 169 was removed from the pod due to a biserial correlation of only .10, considered too low for an 
adaptive test* 

Prior to rescaUng the item statistks to nomial ogive parameters, item difficulties were adjusted by adding 
an arbitrary vahie of + .04 to all nomi group P-values. This was done to compentete for maturation of subjects 
between the age at norming and the age at the experimental testing. The SCAT Series I Technical Manual 
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Table 4. Companion of SCAT Series II 
Veibd Fofim 1 A, IB, IC, 2A, & 2B (N = 3,133) 



Form lt»in# MMn Std Dtv St4 Err KW>20 



lA 


1- 50 


28.7 


8.7 


3.0 


.88 


IB 


51-100 


29.9 


8.8 


3.0 


.88 


IG 


101-150 


24.8 


7.5 


2.8 


.86 


2A 


151-200 


30.5 


8.2 


3.0 


.86 


2B 


201-250 


31.4 


8.5 


2.9 


.88 



reported a constant 4% increase in verbal test scores across quartiles between the 12th and 13th grade years. In 
addition, a restriction of range caused by the selectivity of Horida State Univeisity admissions requirements 
was anticipated, thus making the items for the experimental subjects easier than their nomied item parameter 
estimates. ' 

The difficuhy and discrimination indices on the remaining 248 items in the pool were transformed into 
normal ogive item parameters using the following formulas: 




where 

Pg = the proportion correct for items g 
Z r a normal deviate 

Y = the invene of the cumulative normal distribution function at p^, (a normal deviate) 

/ 8 

r = rg^ = biserial correlation of item score and ability 

(From McBride & Weiss, 1974) 

f 

Appendix. B shows the ETS item statistics and transformed normal ogive item parameters. This 
transformation assumes a normal distribution of ability within the norming group and a metric chosen with 
mean ability equal to 0.0 and a standard deviation equal to 1 .0. . 

After calculation of the bg and ag values four additional items were removed from the stradaptive item 
pool. Itenis 101 and 201 had bg values < - 4.00 and items 48 and 250 had bg values > 4.00. These extrenie 
values were likely outside the ability range of the subject samplermd thus would reduce measurement 
efficiency. 

Statistical analysis of the resulting item pool is shown in Table S. An inspection of Table S points out a 
major problem in the present study. As suggested in Chwpiex III of this p^r, a restriction of range was 
anticipated due to the selectivity of Fk)rida State University admissions. In addition the mean difficulty index 
of -.368 reflects an item pool somewhat too easy (most likely a result of the .04 increase in values). 
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Tables. Descriptive SUtktks of Diffk^ 
and Dilcriniiiuition (tg) Normal Ogive Panineter 



Normal 0|l«* 
P> ram Iter 


Maan 


»4 0«« 


SM Knr 


Kurtatit 




Discrimination (su) 
DifTiculty (bg) 


• .576 

-.3<^ 


.175 
1.132 


.011 
.072 


' -.19 
33 


37 

37 



The distribution of a^^ values was satisfactory, with oidy a sli^t skew and a mean Sg of .57, but the 
combined effect of a relatively easy item pool coupled with an expected hi^ ability subject pod suggested the 
possibility of lowered validity, and intemal consistency reliability coefficients for the conventional (linear) 
test group. V 

Subject Pool 

Each summer, approximately six weeks prior to the start of the academic school year, Rorida State 
University conducts a diree-day University orientation for incoming freshmen. In late July, 1974, thirteen 
hundredstudentsattended the orientation program, 27%of the scheduled first year enrolees. 

Each orientation participant received welcoming packages inchiding a letter from the author of this 
paper* Appendix C presents a copy of the letter,whichreques^d voluntary participation in a computer-based 
rejiearch project. The voluntary nature of the request was required by University orientation oflkials. One 
other source of subject recruitment was utilized. The CRT's used in this experiment Were located in the FSU 
library's listening and viewing center. The library held three library orientation touis each day of the 
orientation program to acquaint the new students with the library facilities. When these groups were brought 
to the area of the listening and viewing center, the author of this paper made **a pitch'' for vdunteers for the 
project. ' 

Of tt£e 103 subjects v^o participated in the research, 87 had previously taken the Florida 12th Grade 
Verbal Ability test (12V). like the SCAT-V,. items UKd in the item pool, 12V items were verbal analogies, 
prepared by ETS for the State of Florida. Item format was identical to SCAT Series II Verbal item fonnat. 
Reliability (KR-^20) of the 1 2V was reported as .87 for SO items with a 20 minute time limit. The 1 2V, thus, 
provided an ideal validity criterion for comparison with linear and stradaptive scores from this experiment. In 
addition, 12 of the subjects without 12V scores had taken either the American College Testing Program (ACT) 
or College Entrance Examination Board (CEEB) verbal tests which had equivalency tables to the 12V. No 
criterion scores were available for two of the stradaptive subjects and one of the linearexaminees. Validity 
indices were thus computed with N « S 3 for the stradaptive group and 46 for the linear. 

Table 6 shows the comparison between 12V norming group statistics and the subjects sami^d in this 
experiment. As can be readily seen in Table 6, the suspected restriction of nmge was certainly evident. 



Table 6. Compariaon of Florida 12th Grade Veibal Test Scoiea 
(1973 Statewide Adminiitratkm vs. Subject Stfnple) 



Taat Or««i» 


H 


lav^r* 


SMDM 


Statewide Nonning Group . 


81000 


26.15 


8.26 


Experiment Participants 


99 


33.83 


5.94 



Pr Oistale « ^exp) » < .001 
Pr (ff* stile « exp) « <Ml 



Both mems and variances of 12V scoies aie signifkant^ difieient from those of the population, with the 
lestricted wiance of partic^U in this study piedoninan^ caused by aOmissionspoHcies, but also possibly 
by a "^ceiling effect.** Regardless of cmise, this restriction would lower validity indic^s within ttie relative^ 
homogeneous group of subjects in this experiment. 

Fortunate^, the primary comparisons of interest in this study were between the stradaptive and linear 
test group partfcipanU. TaUe 7 iiows tfie con^aiison between these two groups within the ejqperiment. 



Tabk7, Comparisoiiof DistributionB of Linear aiid 
Stndaptive Group Floridt 12lh Grade Verbal Scores 







ids Met 


n MOM m 


« Krr Kw 




lilMV 


46 


33^6 


530 .855 


.44 


.70 


Stndaptive 


53 


34.06 


6.12 .841 


.36 


-.03 



As can be seen in Table 7, the random assignment of subjects to linear or stradqitive testing groups did a 
good job of equating the groups on the ability continuum a measured by the Florida 12th Grade Verbal test. 

Researdi Design 

Prior to data collection, 300.random assignments were made to i^ither linear or stradaptive groups and 
the linear group was further randomly broken into five subgroups correq;>onding to the five linear subtests. 

As subject-volunteer entered the testing area, the proctor assigned him the next test listed on the 
randomized testing order schedule. SdiematicaUy , the research design is depicted in Figure 9. 

A comparison of outcomes Of through 0$ would indicate the effectiveness of the randomization process 
in equating subtest assignment. Assuming no significant differences between these outcomes, comparisons 
between 0« through do could then be made. Since SCAT-V published resulu had shown significantly 
different difficulty levels between die five forms, it was planned that linear subtest scores would be normalized 
within their separate <listributions and then pooled into a linear total score distribution for comparison with 
the stradaptive results. 

The independent variables for the comparisons in this study were linear or stradaptive group, 
termination nde, 12V score and scoring method. Dependent variables inchided test score!, item latency, 
number of items, standard errors (and/or reliability), and validity . 

DataColection 

A fUe was created as each subject went through the instructional and testing process. A descr^tion of 
data collected is listed in Appendix D. Item data stored included response code (correct , incorrect or skipped), 
the subject's actual re^Kmse (A, B, C, D or ?),thenumberof the item in die total pool (l-2Sd), the numberof 
I^sentations of the question, and item response latency in seconds* This daU was collected for each of a 
maximum of 60 ittms, wtth the word "^reak" inserted in the item daU file between the first and second tesU 
of a stradqptive subject. 

These subject data files were stored separatdy under individual file names for later analysis siid 
computer-generated reports like Figure 8. 
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# Sub- 12*h gp. 

SAnPum STRATEGY .1eeta " Verbal CRT Verbal 

O2 

O3 08 

0^, O9 

05 ^10 








Q 


Linear 


2 


7 


Llnasar 


3 


9 


> 

Linear 




13 


Linear 


5 ' 


10 



Linear Total ^7 °11 ®12 



Stradaptlve 55 0^3 O^i^ Oj^ 

Total 

R ss Randomization, Cji* Measuremer.t Outcome for Outcome 1 

Figure 9. Research design for linear versus stradaptive gro^p assignment and comparison. 



Data Analysis 

The following analyses were planned: 

1 . Total linear vs. total stradaptive using 3 tennination rules and 10 scoring methods. 

(a) Standard errors of measurement 

(b) Reliability (parailel forms and KR-20) 

(c) Validity (correlation between 1 2V and test scores) number of items per terminated test 

(d) Item latency 

2. Correlation between the linear subject's ability estimate and his 1 2V score and linear test score . 

3. Correlation between the linear subject's 12Vscores and item latency. 

4. Correlation between scores of any subjects who took both linear and stradaptive tests> (This 
situation was not part of the original design of this experiment, but a few subjects requested to **do it again'' 
and were administered the "other" test). This correlation coefficient would be spurioudy high due to common 
items between the liheir test and approximately 1/5^"^ of the items on the stradaptive test. 

AttitudinalCtaita 

Consideration had been given to preparing a questionnaire to survey subject reaction to the 
computer*based mode of test presentation used in this study. It was decided to forego a formal attitudinal 
study for the following reasons: 
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1 . Considenble evidence already exists pertaining to subject reaction to cpmputer-assisted testing and 
instruction (Hansen, 1969). The coniputer mode of prtsenUtion evidently dq^s not d^reaie subject test 
perfomumce. 

2. The main thrust, of the current research wu r validation of the stradaptive model, not of 
computer-testing. 

3. Subjects took only computer-baaed testing and therefore ptobab^r had no realistic basis of 
comparison. 

Despite these considerations, the closing screen shown each subjiect before he left the CRT did request 
any convnents he mi^t have about computer testing 'to aid thfti researchers in future studies.'' These 
comments were jotted into aledger for lync^zingin the conclusior^ section of this paper. 

V. RESULTS AND DISCUteiON 

Tabk 8 shows t comparison of the distribution of the five linear subtests and their respective 12V score 
distribution. 

Table & Comparison of {KstfilMitiotts of 5 Linear Subtests 



1 2th Ora4« Semn . SmMmI Sc«r« 
SMblMt \ 





N 


Mian 


SMM* 




St«D«v 


KuHmH 


SkmviMu 


1 


8 


r 

3^.1 


7.43 


.76 


.11 


-.69 


.32 


2 


7 


3i6 


3.82 


.68 


.15 


-.96 


.52 


3 


9 


30.5 


3.62 


.53 


.08 


-1.39^^ 


.24 


4 


13 


33.4 


6.65 


.81 


.08 


-.61 


-.56 


5 


10 


32.4 


4.67 


.76 


.10 


-.47 


-.56 



Surprisingly , the mean 1 2 V score of the group taking linear test 1 was significantly hi^er than the other 
four groups (p = < 05). In the comparison of the proportion of the items answered conectly (omit^ counted 
wrong) on the five subtests, linear 4 was significantly easier than linear 2 and as expected, linear 3 was 
significantly more difficult than the other four subtests. In addition, Hnetr 3 produced a decide<Uy platykurtic 
distribution, while linear 4 and linear 5 evidenced a concentration of responses at the higher end of the 
distribution. 

Despite these differences in distribution sh^pe, the five subtests were normalized and then pooled for 
group comparison with stradaptive test results. The resulting distribution of total linear scores is shown in 
Table 9. The distribution was essentially normal, though platykurtic. 



Table 9. Distrinilfon of Fooled Luietr Test Scores 



M—n St< Do St< Krr Kitftfte Slfwntt . 

-.02 .946 .138 -^.67 6.06 
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UnetfTcstReliabgity 

Stanley ( 197 1) described the procedures for estimating the internal consistency reliability (KR-20) for a 
test in which different subjects took different items and different numbers of itenu from a unidimensional 
pool. 

Making the standard assunqptions undedying the one-factor random effects analysis of variance 
(ANOVA), an estimated reliability coefficient of the total scores, Xp, of persons receiving Ip items may be 
obtained throii^ the use of the foOowing formula: 



p» =- =1- 

Ip ^ 1) Pintrtcla9S 



'pPintracIt$$ 



Table 10 di^ays the ANOVA source table for the linear group in this experiment. The internal 
consistency reliability estimate for the linear test was .776 for a test of an average of 48.4 items in length. 
Stepped-up to SO items via the Spearman-Brown Prophecy formula, dUs estimate becomes .782. The 
comparable reliability of the original SCAT-V tests was .87.ttingFeldfs(1965)test,Rr(p,gj^»Plln)«<05. 



Table 10. Analysis of Variance for Linear Teit Peiaon by Item Matrix 



Smim 




Shki ml Minrw 


' Mmh Mmtw 


Persons 


46 


37.57 


.817 


Error 


2229 


408.55 


.183 


Total 


1275 


446.12 





^tx(lin)=l-4§| = -776 



It can be assumed that the difference these reliabilities was caused by one or more of three factors: 

1. Testingmode(CRTvt.p4perandpencil) 

2. Eliminationof6itemsfromtheoriginalitempool. 

3. Restriction of range in subject pool for this experiment. 

The latter factor most likeb^ caused the majority of the decrease in the reliability of the test scores. The 
homogeneity of the sul>iects would yield a relatively snudl amount of between«person variance, which, ^en 
coupled with a constant error variance, would lower the reliability estimate. It might also be mentioned that 
Stanley noted that intraclass item correlation is a lower bound to the reliability of the average item. 

Test theory suggests that meuurement efficiency is maximized at p « .50 for a given test group. It wu 
hypothesized that the stradaptive test strategy would better approach this standard than the conventional 
linear test. If supported, this result would indicate an improved selection of items for the stradaptive examinee. 
Table 1 1 shows the resi4t of this comparison. It dearer indicates significantly different distributions of test 
difficulty. The stradaptive test wasfar more difficult than die linear test, with a smaller variance. 
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Table IL Comparison of Difficulty Distributions i?^) 
for Linear and Stiadaptive Groups 



Group 






Sid D«V 


SId Err 


KurtMis 




Linear. 


47 


.752* 


.123** 


.018 


-.87 


-.39 


Stiadaptive 


55 


.584 


.084 


.011 


5.14 


r.97 



•PrO«sti = *'Liii) = <0001 
♦•Pr(a'sto=a\i„) = <05 



This test makes no assumption 
about the equality of population 
V ioyn^+ioyn^ variances, (from: Winer, 1971) 



Cochran's Test for Homogeneity 
of Variance (from: Winer, 1971) 



linear Test Validity 

The reported correlations of the SCAT-V Scries II scores with several criteria are summarized in Table 
12. The correlation of obtained linear scores with the Florida 1 2th Grade scores was .477, which was lower 
than the published SCAT-V:SAT-V correlation (p = <X)1). As with the linear reliability, this diffeience 
probably resulted from the homogeneous distribution of subjects in this experiment. 



Table IZ Reported Correlations of SCAT-V Scores with External Criteria 



Crilarlon 

'. &^ 


N 


'12 


High School English- Grades 


244 


.46 


Normalized Rank in Graduating Class 


244 


.49 


Rank in Graduating Class 


518 


.52 


SAT-V 


244 


.83 



Stndaptive Poolhem Stratification 

Table 13 summarizes the proportion of items in each stratum that were actually used in the stradaptive 
testing. 

Table 13. Proportion of Items in Each Stratum 
Actually used In CRT Strsdaptwe Testing 

(N = 55) 



stratum 

Pr«porti«fi 1234SC7 • t 



Number of Items 

in Stratum 20 26 33 39 31 28 26 22 19 

Available Items 
Used Within 

Stratum .10 .12 .18 .38 .68 .61 1.00 1.00 1.00 



laisest 



39 

35 



0 



- The results depicted in Table 13 tend to contradict Weiss' suggestion that a larger proportion of items 
should be assigned to lower and middk strata (Weiss, 1 973). The present Author recommends that the decision 
be based upon prior knowledge of the distribution of ability of the subject pool to be tested. Such prior 
knowledge includes school admissions requirements and any other information the decision-maker may have 
available about the target population ability level. 

Strtdaptive Total-Test Reliabiity 

Using Stanley's (1971) procedure, it was possible to estimate the intemal<onsistency reliabiUity of the 
person-by-item stradaptive test matrix using scoring method 8. Appendix A, columns 7-9, shows the pattern 
of item presentation across subjects. Of the 244 items in the stradaptive pool, otjty 133 items were actually 
presented to the subject pool in this experiment. 

Scoring method 8 provided the only set of stradaptive test scores wdierein a person's total test score was a 
linear function Of his item scores. Hence, scoring method 8 wu used to estimate intemal<;onsistency reliability 

using Stanley's ANOVA procedure. Table 14 sununarizes these results. 

> 

, in addition to the internal-consistency reliability estimate shown in Table 14, paraUel-forms correction 
on tlie total stradaptive pool using tiie three termination rules with ten scoring methods were calculated* Table 
1 5 displays these results. 



Table 14, Analysis of Variance of Scoring Method 8 
of Stradapthre Test Penon>by-Item Matrix 



Tarmination 
Rula 


Sourca 


<f 


Sum of S«uaraa 


Maan S^uirat 




Persons 


54 


191.941 


3.555 


1 


Error 


1675 


588.253 


. .351 




Total 


1729 




(r2o = .901) 




Persons 


54 


178.870 


3.312 


2 


Error 


1401 


470.442 


.336 




Toial 


1455 




(r2o=.899) 




Persons 


54 


155.841 


2.886 


3 


Error 


1001 


366.447 


.366 




Total 


1055 




(r20 = -873) 



Table 15 shows the statistical analysis of the differences between paraUel-forms reliability estimates on 
the stradaptive test scores. Sgniflcance of. the differences in reliability coefficients (r^) was detcFnpined using 
Ferguson's (1971) formula. . \^ 

Table 16 shows the parallel-forms and KR-20 reliability estimates for the three termination rules i;sed in 
this study. Direct comparisons can be made between the stradaptive KR-20 values and the .776 linear KR-20 
estimate. According to Feldt's (1%S) approximation of the distribuiion of KR-20, aU of the estimates of the 
stradaptive test reliablity are significantiy {p » <.0S) better than the linear KR-20 estimate prior to being 
stepped-up by the Spearman-Brown fomiula Pr(.675 < Pjo < -858) = .95, Thus, the 19, 26, and 31 item 
stradaptive tests all proved more reliable than the 48 item linear test. This is the key finding in this study. 

A comparison of the linear intemal<onsistency reliability coefficient (r^^) and tlie stradaptive 
parallel-forms reliability estimate (r_) can be considered only tentatively since they aie a different kind of 
estimate of the true reliability. The sampling distribution of is known and that of has been 
approximated by Feldt (1965). Oeaiy and Linn (1969) compared standard errors of both mdicei with 
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TA6i,e 15 

CoB]Mtrl8oii of PRrmllel-* Forma RtlUbilities for 10 Strmdaptlre Test 
Scoring Msthodt under Three Teniinfttion Rules Stepped-Up to 50 Tteme 



TBBMINAITON fHULE 1 
(N - 12) 

SCOBIHG MBTOOD 8 6 Z_ 9 4 10 

(Ip • 3I.V5V p„ '929 .910 .902 .879 .703 .620 ,616 .U36 —1 ' — 1 
StatHtlcally Sl«nl-» | | 

I • 1 



■ flo*nt dlffer«no«8 



TERMINATION RULE 2 

^; (N » 28) 

8 9 2_ 1. 10 1 i* 5_ 2 

,806 .782 ,750 .698 .682 .611* ,1*32 .379 ^ * 

'Statistically Slgnl- | 1 

fi.eant differences 



SCOaiNG METHOD 
(Fp « 26.1f7) r„ 



I- 



TERMINATION RUtS 3 
(N - 38) 

-12 2^ -J5 1 2 L. 

.903 .821 .820 .791 '.781* ,689 .590 .587 .582 .513 

• — I . 

I — 1 



SCORING METHOD 

' (Ip - 19.2) r„ 

Statistically Slgnl-* | 
fticant differences 



1 # 

negative psrsllel-foms correlation - differences not calculated 

r * aean maber of lt\Bas for this termination rule 

P * . 



• P " < • 05 between | - 



^ 
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Table 16. Coiiip»itont>f Scoring Method 8 nutld Fonii ReUabBity 
with KR-20 Reliablity Over Thitc Termiiuition Rules 
Stepped Up to SO Itelns 



TtrminatlQn Rul* 







1 


2 


3 


ParaHel 
Forms 


r (raw) 


(N=12) 
.892 
.929 


(N = 2S) 
.688 
.806 


(N=38) 
-.732 
.903 


KR-20 


Pio(raw) 
P2o(50) 


(N = 55) 

.901 

.935 
Ki=31.45 


(N=55) 
.899 
.943 
K. = 26.47 


(N = 55) 
.tii73 
.947^ 

K^=m2 



Kj - average number of items under termination* rule i. 



generated data of known p. They found the standard error of KR-20 to be somewhat smaller thaii that of the 
paraUcl-test correlation (approximately .05 vs. .04 in the range of reliabilities, number of subjectsTind number 
of items invoWed in this experiment). Should these results generalize to this study , scoring methods 6, 7,8, and 
9 under temiination rule 1 , and scoring method 8 under temiination rule 3 produced higher rcliabflity than the 
linear test. 

The interpretation of the results shown in Table 15 was clear. In the comparison of scoring methods, 
methods 6, 7, 8, and 10 were significantly (a = .05) more reliable than methods 1, 2, and 5 within all three 
termination rules. Scoring method 8 produced the highest reliability estimate under all three termination rules. 
In the comparison between the three termination rules, methods 1 and 3 are significantly better than method 2 
(p = <.05) using the Wilcoxoh Matched Pairs-Signed- Ranks Test (Sicgel, 1956). 

Stndaptive Test Validity . 

The validity coefficients of the 10 stradaptive scoring methods under the three termination rules is 
shown in Table 1 7 . Validity was estimated by the correlation between the test scores and 1 2 V scores. 



Table 1 7. Comparison of Validity Coefficients of 10 
Stradaptive Test Scoring Methods Under Three Termination Rules 











Temiination Rule 1 

(N = 64) 










Scoring Method 
'cl 


8 
.526 


9 
.513 


1 

.477 


5 7 3 
.443 .437 .425 


10 
.395 


6 
.385 


2 
.380 


4 

.370 


Scoring Method 
'c2 


8 
.536 


9 
.501 


7 
.471 


Temiination Rule 2 

(N = 80) 

3 5 1 
.420 .403 .397 


10 
.393 


6 
.365 


2 
.350 


4 

.275 


Scoring Method 
'c3 


7 

.509- 


5 

.500 


8 
.499 


TcmunationRule3 

(N = 91) 

3 9 6 
.492 .476 .467 


2 
.455 


,10 
.442 


1 

.410 


4 

.240 



'ci ~ correlation between aiterion measure (12V) and scoring method i. 
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Among the ten scoring method validity coefficient, the following comparisons showed significant 
differences (p = <.05): 

) Termination method 2: 

Scoring method 8 > scoring method 4. « 

Termination method 3: 

Scoring method 7 , S , and 8 > sc bring method 4. 

None of the validity coefficients in Table 1 6 were significantly different from the linear validityxoeHicient of 
.477. Since scoring methods 6, 7, and 8 and 10 were consistently more reliable than the other methods, the 
validity coefficients for these four methods were raised by the so-called "^correction for attenuation" for 
comparison purposes. Table 1 8 ^ows thcresul^'bf this adjustment. 



Table 18 . ^ Effect of the Four Most Reliable Stradaptive Scoring 
Methods Correlation with 12V, Corrected for Attenuation 









Scoring RuM 




Tarmlnation 
Rul« 




c 




• 


10 


1 


/xc. 
(^xc) 


.910 
.385 
(.433) 


.902 
.437 
(.493) 


.929 
JS26 
(.585) 


.620 
395 
(.538) 


2 


^xx 

• ^xc 


.698 
.365 
(.528) 


.750 
.421 
(.544) 


..806 
.536 
(.693) 


.614 
.393 
(.623) 


3 


/xc 


.82J 
.467 
(.627) 




.903 
.499 
(.626) 


.784 
.442 
(.621) 



r == paralicl forms reliability estimate 

'xc " co^'^l^^ion of scoring Method Test Score with 12 V 

(r p = r corrected for unreliability of 12V and stradaptive scoring method 



When both valid.ity arid reliability were considered, stradaptive scoring methods 7 and 8 were judged 
superior to the other methods considered in this study. 

Method 8, the mean difflculty of all items answered correctly , has several characteristics to recommend 
it. It would seem to use the maximum amount of information available from the subject's responses. Since the 
subject's total score under method 8 is a linear transformation (a mean) of the item scores, Stanley's ( 197 1) 
ANOVA internal-consistency reliability estimating procedure is applicable. For both experimental and applied 
situarions,.a single testing design is more feasible than a test*retest or parallel-forms design. 

Method 8 does suffer from two conceptual flaws.'Whenever a subject's ability estimate (entry point) was 
grossly low, scoring method 8 would be biased toward a lower estimate of the subjea's true score. In addition, 
the method is inflated by 'lucky guessing." If an ability estimate were prestored on subjects or^fit could be 
assumed that they could estimate their own ability fairly well, method 8 would be the best method of 
implementing stradaptive testing. 

In the present study, the correlation between the subjects' ability estimates and their total linear score 
was .466^ essentially as good a predictor of their linear scores as the Florida 12th Grade Verbal test scores 
f.477). Under such a situation, scoring method 8 appears to be conceptually sound as an estimator of. a 
subject's true score. 
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In the case where no ability estimate was available for examinees, and it could not be assqined that they 
could fairly accurately estimate their own ability (young children, for example) method 7 would be the 
recommended scoringmethod on aitradaptive test. [ ^ 

Linear ¥S.Stfidaptive Comparisons ' ^ 

Given the stradaptive-test scoring recommendations in the previous sect|6n, how do linear and 
stradaptive testing procedures compare overall? Table 19 shows the results of the'Uiree termination rules for 

scoring method 8 of the stradaptive test along with linear test statistjj^s. % 

/ • ' " 

Table 19. Comparison of Unear Test with 
Scoring Method 8 Under Three Tennination |Uiles 
of the Stradaptive Test 

Strariaptlva TMt Tarmifiation Mathod 
Linear T«it 1 2 3 





Total Test Variance 




.817 


.403 .433 


.433 




Standard Error of Measurement 




.428 


.162 .157 


.157 




KR-20 Reliability 




.776 


.935 .943 


.947 




Parallel Form Reliability 




* 


.929 .806 


.903 




Validity 




.477 


.526 .536; 


.509 




Validity (Corrected for Attenuation) 




.546 


.585 .693 


.626 



♦No linear parallcl-fonns reliability calculated. 



Table 19 provides strong evidence that the measurement efficiency of the average item on the stradaptive 
test is as good or better than the conventional test. Nevertheless, unless a reduction in the number of items 
required occurs, as well as a reduction in testing time, the theoretical gain in efficiency may not have real-world 
value. 

Table 20 shows the difference in number of items presented for the linear and the three termination 
methods of the stradaptive test. The consistency in average number of items presented per subject was 
surprisingly constant over the two parallel tests of termination methods 1 and 3. Method 2 did show a 



Table 20. Comparison of Average Number of Itemi for Linear Test and 
Three Termination Methods of Alternate Form Stradaptive Tests 









SM D4iv 




St* Day 


Twt 


# Sukjaets 


Itwns 


# Itaim 


Itmns 


# Itams 


Linear 


47 


48.43 


.99 






Stradaptive 




Testl 




Test 2 





Method 1 


55 


31.46 


18.03 


38 


30.92 


12.54 


Method 2 


55 


26.94 


16.76 


41 


21 .98 


13.10 


Method 3 


55 


19.20 


14.06 


47 


. 18.19 


11.34 
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significant (p .OS) drop in thef average number of items on the second test, possibly due to the 60-item limit. 

It was hypothesized that mean latency would be higher for stradaptive subjects since they would have to 
"think*' about each item as it was near the limit of their ability. Table 21 reflects the results of this comparison. 



Table 21, Companion of Diftributioiis of Item Latency 
Between linear and Stradaptive Groups 



Group 


Itams 


mmk s«c/nMi 


SM om 


Linear 


2276 . 


35.999* 


12.062* 


Stradaptive 


1730 


40.047 


13.219 


Pr(Mstr = M 


li„) = < 001 








lin) = <001 







The hypothesis of no differences between item latencies was rejected. For the subjects in this 
experiment, the average stradaptive item required approximately 1 l%lon^ than the average linear item. 

Omitting Te ndcncy 

The analysis of the relationship between the tendency to omit and abflity was investigated. If the 
hypothesis of no differences in the tendency across ability levels was rejecteo the handling of the omits could 
create a bias in total test scores. For the subjects in this experiment, the correlation between omitting and 1 2V 
score was -'O^, Pj,(r^j^j^^ j2y^ 0)>.05, thus the hypothesis of no differencfp^as not rejected. 

Correlation Between Scoiesof Subjects Who Took Bodi Stradaptive and Linear Tests 

Six subjects asked to retum the next day aiid take "the other" test. This second testing was permitted, 
with the resulting test score data withheld from analysis except for this section of the paper. The correlation 
between the scores of subjects on both testing strategies provides an indication of the unidimensionality of the 
underlying psychological trait conunon between the two tests. It must be kept in mind, however,. Uiat the 
stradaptive item pool was made up of items from the five linear subtests. Thus a dependency between test 
methods existed. It would be expected that approximately 1/Sth o( the items taken on the stradaptive test also 
appeared on i^subject's linear test. The standardized linear scores and stradaptive score 8 counterparts are 
shown in Table 22. Correlation between the measures was .93. 



Table 22, Linear and Stradaptive Scores of Subjects 
Who Took Both Tests 



Suklaet 


LInMr 


, Stra«aptlv* 


1 


.82 


.67 


2 


-.06 


.30 


3 


-.14 


-.23 


4 


.68 


.81 


5 


.83 


.76 


6 


-.25 


-.16 


1^ 
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Attitudinal Information 

The overwhelming proportion of comments received after the testing was favorable to computer-based 
testing. Only one subject reported prior experience with CRT operation, yet no major problems arose in any 
students operating the equipment. 

Stradaptive subjects tended to conunent that the test was *Veiy hard" and some expressed anxiety at 
only getting about half of the items right. This problem suggests that pediaps adi4)tive testing subjects should 
be led to anticipate **only getting 50% of the items correct in order to keep student nfotivation up. 

The general reaction of the linear subjects wtt that the test was **a sn^/' which was consistent with the 
over 75% correct response rate shown by the linear group. 

TestiiitCosiB . 

No full cost analysis was planned for this study. Howevtr, computer costs were available for the 
three-day data coDecM>n. A to tal of $89M was ^nt over the entire period. This total included core memory 
(CM), central procepor (CP), permanent file storage (MS), data transmittal between the CRFs and the 
computer, line printing (LP), and punch caid output for 109 subjects. The author had data files punched-out as 
they were cieated to assure that data would not be lost in case of a hardware malfunction . 

^ The cost of testing each individual came to less than 2 cents per subject for CM, CP, MS, and LP time on 
the CDC 6500 computer. Excluding software preparation costs and hardware rental,etc., this is the expected 
computer cost per subject in a large-scale testing program, once set up and operating. The salaiy of proctors has 
not been included in this analysis, although this cost would certainly be small when pro-rated over a large 
number of subjects. 

In the present study, 6 CRT's were kept on and tied to^the computer continuously for 14 hours a day for 
3 days in order to be ready for subject-volunteers whenever they anived. In any implementation of 
computer-testing outskle the experimental situation, exam time would be schedided, thus minimizing 
telephone line transmittal costs. 

This cost approximation could be compared with testing costs from the reader's experience. Without 
trying to define conventional test cost per se, there is still little doubt that computer-based testing ccsts less 
than conventional testing with the paper and pencil mode for any large-scale testing program. 



V. CONCLUSIONS AND IMPLICATIONS FOR FUTURE RESEARCH 

The results of this study favor the further investigation of the stradaptive testing model. The model 
produced validity coefilcients comparable to conventional testing with a reduction of the number of items 
from 48 to 31, 25 and 19 for the diree stradaptive termination rules investigated in the study. The internal 
consistency reliability for the best stradaptive scoring method was significantly higher than the conventional 
KR-20 estimate, and the jitrad^tive parallel-fomns reliability estimates were consistently higher than the 
conventional KR-20 for the best of the scoring methods. 

The author was not aware of any prior research showing a comparison of item latency data between 
adaptive and conventional testing modes. Results in this study clearly indicate that subjects take significantly 
longer to^answer items adapted to their abUity level, about 11% longer in the present study. This is an 
important result, as it indkates that future research into adaptive testing of any kind should take this variable 
into constderafion vdien evaluating an adaptive test strategy. The net gain of the adaptive model is really a 
function of the testing time needed to adequately measure a subject's ability, not the number of items 
presented to the subject. All prior research reviewed tacitly assumed that item latency was consistent across 
testing strategies. This study indk;ated this assumption is false. 

The statistical power of the tests for significant difference between flie experimental and control groups 
in this study was too tow. Nearly every researcher is forced to "settle" on a smaller "n" than desired due to the 
external constraints imposed on his research. This was certainly true in the present study. It is the author's 
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intent to make this study the fiist step in an on-going investigation of the stradaptive model, much as Weiss is 
doing at the University of Minnesota. Where significant differences did not emerge, as in the validity 
coefficients, the trend was consistently favonbie to the best stradaptive models in comparison to the linear 
models. Should this trend be upheld as the number of subjects in the research grows, stronger statements about 
the comparative validities of the two methods could be made. This possibility alone suggests Aat model 
investigation be continued. 



' M^ithin the three termination rules investigated, KR-20 reliabilities were essentiaUy equal for a test length 
of SO items. Termination method 3, however^ would have yielded an equivalent reliability estimate at 25 items 
to the "raw" KR-20 estimates of the other two methods at 26 and 31 items. This result sivports Novick(l%9) 
and Wood (1 971) evidence that the efficiency of adaptive testing 'levels off." Their result on Bayesian models 
suggested that from 1 5 to 20 items was optimal, as opposed to the 254tem ^'peak" shown in the present study. 

The validity comparisons between the three termination strategies did not yield significant differences. 
The trend, however, consistently showed method 2, wherein omits were counted as wrong and 5, the minimum 
number of items in the ceilii^ stratum, as producing poorer measurement than the other two termination 
methods. This result is difficult to explain. Method 1 ignpred omitted items and set themtnimum number of 
items in the ceiling stratum to 5. Method 2 considered omits wrong, but used the same test termination rule for 
the ceiling strata. Theoretically, the consistent difference between these two methods should reflect that the 
first treatment of omits was better. Method 3, which used an identical treatment of omits to method 2, but set 
the stopping minimum at 4 items in the ceiling stratum, was also better than method 2. This second result 
suggests that presenting less items yields higher reliability when omits are counted as incorrect answen. 

The analysis of the termination rule is further complicated by the existence of Wei»' other branching 
model. In the present author's judgment, the strategy of branching to atower stratum after an omitted item is 
conceptually superior to the repetition of another item within the same stratum. Wei»' preliminary results 
(personal communication) support this hypothesis u he has consistently found the test-retest reliability of the 
first model to be about .10 hi^r than the model used in this present study. Given the model evaluated in this 
experiment, the author would recommend that termination nip|hod 3 be used in future stradaptive testing 
since its measuremen t effectiveness is comp arable to the other 2 me thods, but with lett items. 

In the compairisons of scoring methods, the mean difficulty of all items answered correctly is 
recommended for any subject pool \rfiom it could be assumed would adequately estimate their own ability. 
Scoring methods 6, 7, and 10 yielded parallel-forms correlations that were statistkally equivalent to method 8, 
but ntethods6and 10 consistently produced lower validity. These results arc understandable for method 6, the 
mean difficulty of all items in the highest non-chance stratum. The j^thor would expect this estimate to be 
fairly accurate, but unfortunately the number of possible scores using methods 4, 5, and 6 is limited to the 
number of strata in the item pool. Method 7, the interpolated stratum difficulty , corrects for this deficiency in 
method 6. M^hod 1 0, the average difficulty of the correct responses in the highest non<hi^ stratum, is 
conceptually appealing to the present author. The ability estimate from scoring method lOis ndfaffected by a 
poor entry point ability estimate by the subject or by 'lucky" guesses about a subject's abilitx stratum. 

It is recommended therefore, that future stradaptive experimental studies concentrate upon scoring 
methods 7, 8, and 10. These studies should also consider both stradaptive branching models with a comparison 
of resuhs from variation in the minimum number of items in the ceiling stratum. A comparison between these 
variable number of stage strategies and several fixed number of stage strategies is derirable. The author plans 
such ah analysis on the present data in the near future. As suggested in prior research, adaptive testing may 
reach '"peak" efficiency at between 15 and 20 items. A comparison of stradaptive test statistks for example 
with k = 10, 15, 20, and 25 iten» with linear testing should investigate this hypothesis. Once the stradaptive 
data is coDected under the variable strategy, the fixed item statistics can be determined by grading the 
stradaptive test after "K" items and then "starting" the subject's second test at the first item of the entry point 
level. 
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One further suggestion for future stradaptive studies has occurred to the author. Following the same 
logic whkh led to termination of a subject's testing when S items in a row in the highest stratum had been 
conectiy answered, the misung of 5 items in a row in any stratum should provide an immediate cefling stratum 
definition. The probability of the occurrence would be less than .OS for a properly nornied item pool, bi the 
case of the present study, 13 of the 55 stradaptive subjects would have terminated in a stradaptive test an 
average of 12.1 items earlier than termination method 1, with no effect upon the other 42 subjects. The 
resulting stradaptive test statistics obtained from the implementation of this suggestion have not been 
calculated, except that the change would have reduced the average number of items presented under 
termination method 1 to 28.4 from 31.45. The author (dans this test statistic analysis for the near future. 
However, the suggestion was listed here for the consideration of any other stradaptive invtfQgatohK 



Aside from the stradaptive model per se, further research into adaptive* testing in Wmch both tlii^ number 
of stages and step-size are variable is recommended. The Bayesian strategies and Urry^f model are exvhples of 
this category of adaptive measurement, and further model development seems appropriil^e. / 



Research is necessary with comparisons between stradaptive models rather than the traditioni^esign of 
comparing adaptive method with the conventional method of testing. Weiss* on-going research "project is 
beginning this type of work, but more is needed. The traditional comparison assumes that conventional test 
statistics are the criterion that an e)q>erimental testing procedure should try to duplicate. Lord, Green, Weiss et 
al., have argued that improved measurement of the individual at all ability levels may be hidden by the use of 
classical test statistics such as validity and even reliability. Levine and Lord (1959) suggested an index of 
discrimination which considered various levels of the test score range and Lord's (1972) information function 
theory and item characteristic curve theory are an attempt to solve this problem . More theoretical research in 
this area is needed. 

The goal of this study included the attempt to estimate the degree to which the violation of the 
assumptions of the one-factor ANOVA model affected KRr20 reliability estimates. The assumption tbat itens 
are independent of one-another clearly is violated in any adaptive testing procedure. The degree of effect this 
assumption violation causes is unknown, yet most prior research in adaptive testing which has considered 
pliability at all, has only considejred ANOVA KR*20 estimates. 

Certainly the results from this study do not aUow any definitive statements about this question. 
Nevertheless, the three Kk-20 estimates were consistently higher than the 3 parallel-forms reliabilities. Cleary 
and Linn's (196^) monte carlo study indicated that rjo provided better parameter estimation than 
parallel-fprms reliability estimates, so one must question whether the higher p cstin ♦•s are not the result of 
the dependency between items. Periiaps the only way this question can be answerer' through a monte carlo 
study of adaptive testing with p known and the two methods compared, for estimating p. 

Green (1970) concluded that the coinputer has onfy begun to enter the testing business, and that as 
experience with computer-controlled testinggrows,important changes in the technology of testing will ocfur. 

He predicted that "most of these changes lie in the future in the inevitable computer conquest of 

testing."^ 

The stradaptive testing model would appear to be one such important change. 



^Grccn. B.F.. Jr. In Holtzmin (Ed.). 1970. p. 194. 
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APPENDIX A: ITEM STATISTIC COMPARISON 
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ITEM 
NUM. 



01 
02 
03 
Ok 

05 
06 

07 
08 
09 
10 
11 
12 
13 

15 
16 

17 
18 

19 
20 
21 
22 

.23 
2U 

25 
26 

27 
28 

29 
30 
31 
32 

33 
34 

35 
36 
37 
38 

39 
i^O 
in 
i^Z 
43 

45 
46 



NORM- GROUP 



N 



3133 

3133 

3133 

3133 

3133' 

3133 

3133 

3133 

3133 

3133 

3133 

3133 

3133 
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3133 

3133 

3133 

3133 

3133 

3133 

3133 

3133 

3133 

3133 
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3133 

3133 

3133 

3133 

3133 

3133 

3133 

3133 

3133 

3133 

3133 

3133 

3133 

3133 

3133 



.86 
.86 
.92 
.93 
.92 
.81 
.79 
.70 
.85 
.83 
.79 
.71 
.73 
.77 
.75 
.68 
.56 
.72 
.58 
.64 
.58 
.58 
,60 
.58 

.63 
.70 
.58 
.,60 
.68 
.48 
,62 
.52 
.53 
.51 
,46 
.38 
.55 
,42 
,40 
.52 

;35 

.53 
.45 
.38 
,40 

.35 



LiMEftR GROUP 



STRDPTV GROUP 
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0- 


*«* 


*** 




10 


'l,00 


0 


*** 


*** 


««« 


10 


1,00 


0 


*** 


*** 


««* 


10 


1.00 


0 


*** 


««« 


«** 


10 


.90 


.10 


1 


1.00 


0 


10 


1,00 


0 


*** 




*** 


10 


1.00 


0 


*** 


*** 


*** 


10 


.90 


,10 


*** 


*** 




10 


1,00 


0 


««« 


*** 


*«« 


10 


.90 


.10 


««* 


*** 


**« 


10 


1,00 


0 


15 


.87 


.10 


10 


.90 


.10 


*** 


*** 


««« 


10 


,80 


.13 


*** 


*** 


««« 


10 


.70 


.15 


*** 


*** 


«««. 


10 


.90 


.10 


*** 


*** 


««* 


10 


.60 


,16 


*** 


*** 


««« 


10 


.70 


.15 


40 


• 78 


.07 


10 


.90 


. .10 


*** 


*** 


««« 


10 ■ 


.60 


" .16 


1 


1.00 


0 


10 


1.00 


0 


3 


.33 


.33 


10 


.70 


.15 


1 


1.00 


0 


10 


.90 . 


.10 


2 


0 


0 


10 


1.00 


0 


1 


0 


0 


10 


.80 


.13 


««« 


««« 


»»» 


10 


.90 


• .10 


1 


■1.00 


0 


10 


.80 


.13 


««« 


««« 


««« 


10 


,80 


.13 


*«« 


««« 




10 


1.00 


0 


3 


1.00 


0 


10 


1.00 


0 


««« 


«*« 


««« 


10 


,40 


,16 


4 


1.00 


0 


10 


.90 


,10 


««« 


««« 


««« 


10 


.70 


.15 


39 




.08 


10 


.60 


.16 


««« 


#*« 


««« 


10 


.70 


.15 


««« 


**« 


««« 


10 


.50 


.17 


54 


.56 


.07 


10 


,40 


.16 


43 


.49 


.08, 


10 


.70 


.15 


««« 


*** 


««« 


10 


.70 


.15 


37 


.38 


.08 


10 


.30 


.15 


2 


0 


0 


10 


.oO 


.13 
.16 






no 


10 


,40 


34 


.41 


.09 


10 


n60 


,16 


««* 


««« 


««« 


10 


.30 


.15 


14 


.57 


.14 


16 


.60 


>,.16 


40 


.50 


.OR 


10 


.70 


.15 


54 


.52 


.07 


10 


.30 


.15 


21 


.33 





51 

55 



I 



ITEM 


NORM G 


ROUP 


LINEAR GRO 


UP 


■ 

STRDPTV GROUP 


NUM. 
















1 . 




N 


P 


N 


P 


S.E. 


N 


P 


S.E. 


^7 

Ji Q 


3133 


.49 


10 


.70 


•15 


2 


.50 


.50 




3133 


.34 


10 


^40 


,16 


32 




• 09 


50 


3133 


.28 


. — 


... 


— — 


19 


.26 


• 10 


51 


3133 


.90 


8 


1.00 


0 


*«* 


««« 


««« 


52 


3133 


• 86 


• 8 


,88 


.13 


«#« 


««« 


««« 


53 


3133 


• 88 , 


' 8 


1.00. 


0 


*** 


««« 


««« 


54 


3133 


.77 


8 


.75 


.16 


#** 


««^ 




55 


3133 


.87 


8 


1.00 


0 


*«« 


««« 


««« 


56 


3133 


• 84 


8 


.75 


.16 


**« 


««« 




57 


3133 


•88 


8 


1.00 


0 


*«« 


«•« 


«#« 


58 


3133 


,86 


8 


1.00 


0 




««« 


««« 


59 


3133 


.69 


8 


1.00 


0 


40 


.93 


.04 


60 


3133 


• 64 


8 


.88 


.13 


3 


• 67 


*33 


61 


3133 


.76 


8 


.75 


.16 


««« 


««« 


««» 


62 


3133 


.71 


8 


.88 


.13 


««« 


««« 


»»» 


63 


3133 


.69 


8 


1.00 


0 


««« 




««« 


6*1' 


3133 


.70 


8 


.38 


• 18 


««« 


««« 


««« 


✓ — 

65 


3133 


.71 


8 


.75 


,16 


16 


.63 


.13 


66 


3133 


*83 


8 


.75 


.16 


««« 


««« 


««« 


67 


3133 


.71 


8 


.63 


,18 


««« 


««« 


««« 


68 


3133 


.75 


8 


.75 


.16 


««« 


««« 


««« 


69 


3133 


.63 


8 


.88 


.13 


««* 


««« 


««« 


70 


3133 


.74 


8 


.75 


,16 


««« 


««« 


««« 


71 


3133 


• 64 


8 


1.00 


0 


14 


.57 


• 14 


72 


3133 


.62 


8 


.88 


.13 


««« 


««« 


««« 


73 


3133 


• 57 


8 


1.00 


0 


33 


.70 


.08 


nil 
74 


3133 


.75 


8 


.88 


.13 ' 


««« 


««« 


««« 


0 IT 

75 


3133 


.36 


8 


.38 


• 18 


21 


.52 


.11 


76 


3133 


.55 


8 


.25 


,16 


««« 


««« 


««« 


77 


3133 


•^3 , 


8 


,63 


.18 


11 


.55 


.16 


78 


3133 


• 48 


8 


1.00 . 


0 


47- 


.64' 


.07 


79 


3133 


Z Ja 

• 64 


8 


.63 


.18 


10 


.60 


.16 


80 


3133 


.47 


8 


.88 


.13 


51 


.69 


..07 


81 


3133 


.51 


8 


.50 


.19 


38 


.66 


.OP . 


82 


3133 


.52 


8 


.63 


.18 


7 


.14 


.14 


83 


3133 


.54 


8 


.63 


,18 


21 


.67 


.11 


84 


3133 


.47 


8 


.25 


,16 


47 


.53 


.07 


85 


3133 


.57 


8 


.25 


. .16 


««« 


««« 


««« 


86 


3133 


Ja ^ 


8 


.38 


.18 


1 


1.00 


0 


87 . 


3133 


.38 


8 


.63 


.18 


53 


.49 


.07 


88 


"5133 


.46 


8 


.50 


.19 


2 


0- 


0 


8^ 


3133 


.52 


8 


.75 


,16 


««« 


««« 


««« 


90 


3133 


.50 


8 


.63 


,18 


55 


.67 


.06 


91 


3133 


.39 


8 


.63 


.18 


43 


.56 


.08 


92 


3133 


• .45 


8 


.38 


,18 . 


54 


.43 


.07 



ERIC 



52 



56 



ITEM 


NORM GROUP 


LINEAR GRO 


UP 


STRDPTV GROUP 


NUM. 




















N 




n 


p 




U 


F 




93 


3133 


.3^ 


8 


.25 


.16 


37 


.30 


.08 


9^^ 

■95 


3133 


.36 


8 


.63 


.18 


28 


.46 


.10 


3133 


.41 


8 




.19 


52 


.67 


.07 


96 


3133 


.26 


8 


.25 


.16 


52 


.39 


.07 


97 


3133 


.24 


8 


.25 


.16 


32 


.34 


.09 


98 


3133 


• 35 


8 


.75 


.16 


20 


.40 


.11 


99 


3133 


. 33 


8 


.75 


.16 


16 


.31 


.12 


100 


3133 




8 


.38 


.18 




M 


.07 


101 


3133 




M «• «• 







1 


0 


0 


102 


3133 




7 


1.00 


0 


*** 


*** 


««« 


103 


3133 


.67 


7 • 


.88 


.14 


««« 


*** 


**« 


104 


31*^3 . 


.8*5 


7 

r 


.71 


.18 


1 


p 


0 


105 


3133 


.86 


7 


1.00 


0 


*** 


*** 


*«« ^ 


106 


11 H 


.89 


7 


. 57 


.20 


**« 


*** 


««« 


107 




.86 


7 


1.00 


0 


*** 


*** 


'««« 


108 


1133 


.81 


7 


.86 


.14 


*** 


*** 


««« 


109 


nil 




7 


. 57 


.20 


*** 


*** 


««« 


110 
111 


n 11 
n 11 




7 
7 


.88^ 


.20 
■ .14 


27 
«** 


.67 

««« 


• 09 
««« 


112 


1111 


. '52 


7 


1.00 


0 


««« 


««« 


««« 


113 


nil 


• -? J- 


7 


. 57 


.20 




.61 


.07 


114 


1113 


.72 


7 


. 57 


.20 


*«* 


««« 


««« 


115 


3133 


.77 


7 


.86 


.14 


*** 


««« 


««« 


116 


3133 


.67 


7 


.71 


.18 


*** 


««« 


««« 


117 


1111 


.69 


7 


.86 


.14 


12 


• 50 




118 


1111 


.66 


7 


.29 


.18 


««« 


*«* 


««« 


119 


n 11 


.68 


7 


. 57 


, .20 


8 


.63 


.18 


120 


n 11 


.62 


7 
,f 


1.00 


0 


*** 


*«« 


««« 


121 


1111 


.66 


7 


.43 


.20 


*«* 


*** 


««« 


122 


n 11 


.61 


7 


.29 


.18 


**« 


*** 


««« 


123 


1111 


.61 


7 


. 57 


.20 


2 


0 


0 


124 


nil 


.60 


7 


.43 


. .20 


*** 


**« 


««« 


125 


n 11 




7 


.43 


.20 


«** 


*«« 


««« 


126 


n 11 




7 


.43 


.20 




.26 


.^7 


127 


nil 




7 


.29 


.18 


26 




.01 


128 


nil 


. 57 


7 


.29 


.18 


**« 


*** 


*** 


129 


n 11 


.U3 ' 


7 


.14 


.14 


^9 


• 53 


.07 


130 


n 11 




7 


.14 


.14 


54 


.50 


.07 


131 


3133 


.56' 


7 


.29 


.18 


3 


0 


0 


132 


3133 


..42 


7 


.29 


.18 


48 


.38 


.07 


' 133 


3133 


.50 


7 


.57 


.20 


18 


.33 


.11 


134 


3133 


.^3 


7 


.29 


.18 


49 


.14 


.05 


135 


3133 


.48 


7 


.43 




53 


.55 


.07 


136 


3133 


.51 


7 


.86 


.14 


18 


.78 


.10 


137 


3133 


.57 


7 


.86 


.14 


9 


.56 


.18 


138 


31>3 


.56 


7 


.71 


.18 


38 




.55 


.08 



O 53 

ERIC ' 57 



IT?M 


NORM G 


ROUP 


LIN 


EAR GROUP 


STRDPTV GROUP 


NUM. 




















N . 


P 


N 


P 


S.E. 


N 




S F 




3133 


'iO 

• 3y 


0 
( 


• 57 


.20 


25 


.32 


.10 




3133 


/i'i 


ry 
( 




.14 


1 


0 


0 


1 In 


3133 


'i c 

• 35 


r% 
( 


.86 


.14 


22 


.68 


.10 




3133 


'iO 

• 3y 


n 
( 


.29 


.18 


50 


.36 


.07 


If 3 


3133 


• *fo 


0 
( 


0 - 


0 


40 




.07 


ILL 


3-L33 


'i^ 
• 30 


0 
f 


.71 


.18 


15 


J. .67 


.13 




3133 


'i c 

• 35 


0 
( 


.57 


.20 


17 


.41 


.12 




3-L33 


'iP 
• 30 


0 


.29 


.18 


24 


.54 


.10 




'in 'i'i 
3133 




0 

r 


.86 


.14 


37 


.7B 


.07 


±*rO 


'ii 'i'i 
3133 


'in 
• 3U 


0 
{ 


0 


0 


51 


.14 


.05 




'il 

3J-33 


0 0 


0 
f 




.14 


43 


.40 


.08 


±5U 


'in 'i'i 
3-L33 






— — — 




— ■" — 


— — 


— 




'in 'i'i 
3133 




n 'i 
1 J 


.92 


.08 








1 CO 


'in 'i'i 
3133 


• 0 u 


n 'i 
J-3 


.85 


.10 






### 




'in 'i'i 
3133 


• oy 


n 'i 
1 J 


.92 


.08 










'in 'i'i 
3133 


fi c 
• 05 


n 'i 
13 


1.00 


0 




»»» 


*** 




'in 'i Q 
3-L33 


• 0*+ 


n 'i 
-L3 


.92 


.08 


5 


.80 


.20 




'in 'i'i 
3J-33 




n 'i 

-L3 


1.00 


0 


**♦ 






T CO 


3133 


00 
• 77 


13 


1.00 


0 


»»» 






n cfl 

130 


'in 'io 
3133 


Q 'i 
• 03 


n *i 
13 


.85 


.10 


»»» 






1 CO 


3133 


QO 
%OC 


n *i 
13 


1.00 


0 








loo 


3133 


Q 'i 

• 03 


n 

13 


.92 


.08 








±01 


3133 


0 0 

• 7r 


n *i 
13 


.85 


.10 


JL JL JL 
WWW 


www 


JL JL 
WWW 


1 AO 

10^ 


'in 'i'i 
3-L33 


00 


n 'i 
-L3 


1.00 


0 


2 


1 . GO 


0 










.77 


.12 




*♦* 






'in 'i'i 
3-L33 


Pn 
• 0 u 


n 

13 


1.00 


0 






»»» 


103 


3133 


Xc 


n *i 

13 


.77 


.12 


1 


1.00 


0 


100 


3133 


oJl 

• 7** 


13 


.92 


.08 


»»» 






10/ 


3133 


0 C 

• 75 


n *i 
13 


.69 


.13 




»»» 




lOo 


3133 


0 A 
• 70 


13 


.85 


.10 


9 


.78 


.15 


T Xq 

10 y 


3133 















- 


TOO 
170 


3133 


00 

• 72 


n *i 
13 


1.00 


0' 




»»» 




TOT 

171 


3133 


• 0*f 


n *i 

13 


.92 


.08 








172 


3133 


• 05 


n 'i 
±3 


.92 


.08 








TOO 

173 


on 'i'i 
3133 


• o*f 


n *i 
-L3 


.69 


.13 


### 






1 oil 


'in 'i'i 
3133 


Xn 
• 01 


n 'i 
J-3 




.14 


*** 






IOC 

ir 5 


*in 'i'3^ 


• 00 


n 'i 
J-3 


.92 


.08 


^> 1 


1.00 


0 


1 0^ 
1/0 


'in 'i'i 
3133 


.65 


n 'i 
J-3 


.77 


.12 


2 


1.00 


0 


ir r 


'in 'i'i 
3J-33 


Ait 


n 'i 
J-3 


.92 


.08 


»»» 


»»» 


»»» 


1 OQ 

1/0 


'in 00 

3133 


CQ 
• D7 


n 'i 
-L3 


.69 


.13 


3 


1.00 


0 




3;L33 


.60 


13 


1.00 


0 


»»» 


»»» 


»»» 


180 


3133 


.52 


13 


.39 


.14 


27 


.67 


.09 


181 


3133 


,55 

.50 


13 


.92 


.08 


1 


0 


0 


182 


3133 


13 


.85 


.10 


5 


.PO 


.20 


183 


3133 


.51 


13 


.69 


.13 


3 


.33 


.33 


18^* 


3133 




13 


.69 


.13 


»»» 


*** 





ERIC 



ITEM 


NORM GROUP 


LINEAh GROUP 


ST3DPTV GROUP 


NUM. 


























p 


S »E. 


N 


p 


S .E. 


. 185 


3133 


*52 


13 


.92 


.08 


- 

29 


.66 . 


.09 


186 


3133 


.^6 


13 • 


.39 


.1^ 


28 


-.71 


.09 


187 


3133 


.50 


13 


.85 


.10 


^9 


.65 

• 


.07 


188 


3133 


A.68. 


13 


.85 


.10 


2 


1.00 


0 


189 


3133 


' .52 


• 13 


.62 


.1^ 


48 


.69 


.07 


190 


3133 


.53 


^3 


.85 


.10 


*** 


»»» 




191 


3133 


.'^3 , 


13 


.92 


.08 


55 


.78 


.06 


192 


3133 


.50 


13 


.77 


.12=' 


*** 


*** 


»»» 


193 


3133 


.^r 


13 




.1^ 


32 

y^ 


.47 


.10 


19^ 


3133 


• y J 


13 


.92 


.08 


»»» 


*** 


»»» 


195 


3133 


.33 


13 


.77 


.12 


10 


.80 


.1^ 


196 


3133 

y y y 


• 30 

^ y ^ 


13 

* y 


.62 


.1^ 


55 

y -y 


.40 


.07 


197 


"3133 


.3^ 

• y^ 


13 


.62 


.1^ 


12 


.91 


,0P 


198 


3133 

y-*-yy 




13 


.77 


.12 


23 

^ y 


.48 


.11 


199 


3133 
y-^ y y 


% y 


13 


.23 


. 12 


52 


.42 


. 07 


200 


3133 
y-^yy 


.22 


13 


.77 


.12 


30 

y ^ 


.47 


,00 


201 


3133 
y y y 
















202 


3133 
y-*- y y 


.89 


9 


1,00 


0 




»»» 


»»* 


203 


3133 
y y y 


.90 


9 


.89 


/ .11 




»»» 


»»» 


20^ 


31 33 
y^yy 


.77 


Q 
y 


.67 


.17 


3 

y 


1.00 


0 


205 


3133 
y-*- y y 


.88 


9 


1.00 


.0 


1 


* 0 


0 


206 


3133 
y-*-yy 


.83 


9 

y 


1.00 


0 




»»» 


»»» 


207 


3133 
y-*-yy 


.80 


9 


1.00 


0 


9 


.89 


.11 


208 


3133 
y-^yy 


.86 


9 

y 


.78 


.15 


»»» 


»»» 


»»» 


209 


3133 
y y y 


.69 


9 


.78 


.15 


»»» 


»»» 


»»» 


210 


3133 

J-*- y y 


.7^ 


9 


.67 


.17 


»»» 


»»» 


»»» 


211 


3133 • 

y y y 


.68 


9 


.78 


.15 


, »»» 


»»» 


»»» 


212 


3133 

y y y 


.81 


9 


.78 


.15 


»»» 


»»* 


»»» 


213 


3133 
y y y 


.76 


9 


1.00 


0 


»»» 


»»* 


»»» 


21^ 


3133 

y y y 


.69 


9 


..67 


.17 


»»» 


»»» 


»»» 


215 


3133 
y y y 


.67 


9 


.89 


.11 


20 


.90 


.07 ; 


216 


3133 
y y y 


.82 


9 


.89 


.11 


3 


1.00 


0 


217 


3133 

y y y 


.'71 


9 


.89 


.11 


»»» 


»»» 


«»» 


218 


3133 

y-*- y y 


.89 


9 


.89 


.11 


»»» 


*»* 


**» 


2X9 


3133 

y-*-yy 


.78 


9 


.67 


.17 


»»« 


»** 




220 


3133 
y-^yj 


.83 

y 


9 


1.00 


0 


»»» 


♦*r 




221 


3133 
y y y 


.73 


9 


.89 


.11 


»»» 


*♦ > 




222 


3133 


.78 


9 


1.00 


0 


»»» 


»»* 


»»» 


223 


3133 


.7^ 


9 


.78 


.15 


»»» 






22^ 


3133 


.71 • 


9 


1.00 


0 


2 






225 


3133 


.80 


9 


1.00 


0 


7 


1.00 


P 


226 


■3133 


.66 


9 


.89 


.11 


5 


,R0 




227 


3133 


.61 


9 


.78 


.15 


»»» 


»»» 




228 ' 


3133 


.70. 


9 


' 1.00 


0 








229 


3133 


.61 


9 


.^^ 


.18 


»»» 


»»» 




230 


3133 


.65 


1., ' 


.67 


.17 


13 


■ .62 


.14 



er|c 



ITEM 
NUM. 


NORI^ G 


HClfp 


LINEAR GROUP 


STRDPrv GsrTnp 


N 


' P 


N 


P 


S.E. 


N 


P 


S.E. 


231 
232 

233 
234 

235 
236 

^ 237 
238 
239 
240 
241 
242 
. 243 
244 
245 
246 
247 
248 
249 
250 


3133 

3133 

3133 

3133 

3133 

3133 

3133 

3133 

3133 

3133 

3133 

3133 

3133 . 

3133 

3133 

3133 

J^JJ 

3133 
3133 
3133 


.58 
.52 
.41 

.57 
.48 
.58 
.54 
.43 
•,52 
.42 
.61 
.44 
.34 
.28 
.45 
.37 

.42 
.23 


9 
9 

9 
9 
9 
9 
9 
9 
9 
9 
9 
9 
9 
9 
9 
9 
9 
9 


.67 
.78 
.56 
.78 
.78 

.89 
.78 
.78 
.78 
.67 
.67 
.67 
.67 
56 
.67 
.33 
.56 
.56 


.17 

.15 
.18 

.15 
.15 
.11' 
, .15 
.15 
.15 
.17 
.17 
.17 
.17 
.18 

.17 
.17 

.18 
.18 


9 
16 

a8 

5 

43 
««♦ 

23 
4 

15 
7 
28 

23 
37 

J J 

53 
11 

33 
1 

17 


.89 
.50 
.50 
.60 
.86 
««« 

.91 
1.00 

.53 
■ .86 
.82 

.57 
.70 

< < 

.72 

.73 
.58 

0 

.35 


.11 

.13 
.12 
.25 

.05 
««« 

. .06 

0 

.n 

.14 

.07 
.11 

.OP 
• < V 

.06 

.14 
.09 

0 

.12 



This item not presented to stradaptlve subject. 
This item removed from stradaptlve pool. 

V 
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APPENDIX B: TRANSFORMATION OF TRADITIONAL ITEM 
DIFFICULTY (P^) AND BISERIAL CORRELATION (r^) 
TO NORMAL OGIVE PARAMETERS b^^ AND ag 
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Item Number 



1 


• .ftp 


2 


.ftp 




ftft 




ftQ 




ftft 




• f f 


7 

f 


o c 


O 

o 


• oo 


Q 


ft1 




OQ 


11 


o c 
' •fD 


12 


* •Of 


13 


• o^ 


14 


O Q 
• fJ 




Ol 


16 




1 7 




1 P 


m OO 


1 O 


• D^ 


icU 


• OO 








• 


p 


• JO 






P < 


• ^7 


26 


.66 


27 


.54 


28 


.56 


29 


.64 


30 


.44 


^ 31 


.58 


32 


.48 


33 


•49 


34 


.47 


35 


.42 


36 


.3^ 


37 


.51 


38 


.38 


39 


.36 


40 


.48 





. 01 




.oo 




.47 




• 54 


1 O/^ 


' .69 


•1.91 


• 62 


• •95 


• 71 


ft^ 

• OO 




-x»91 


•46 


-1.55 


• 52 


-1 • U5 




. 7< 
- • r 5 


• 59 


■•1 • lo 


• IZ 


•X • vy 


c^ 
• 5o 




• O J 


- .64 


cA 
• 50 


• • X\J 


CO 

• 5Z 


. 7*^ 


• o*f 


- 1ft 
• xo 


• DO 


• •55 


• *l'0 


- • Jx 


OO 

• 32 


• xO 


• 0*1' 




c n 

• 50 




• *f*l' 


- . «;4 


Zip 


- .83 


- .50 


- .16 


.63 


- .24 


.62 


- .85 


.42 


.32 


.47 


- .35 


.57 


.09 


.58 


.05 


.54 


.12 


.61 


.^3 


.47 


.88 


.47 


- .04 


.66 


.62 


.49 


.66 


.54 


.11 


.46 



• 77 
.88 

.53 
.64 

.95 
.79 
1.01 

.55 
.52 
.61 

.83 
.73 
.46 
.6R 
.fll 
.6P 
.61 

.83 
.68 
.52 
.3^ 
.83 
.58 

, .^9 
.46 

.58 
.81 
.79 
.46 
.53 
.69 
.71 
.64 

.77 
.53 
.53 
.88 
.56 
.64 
.52 



FRir 
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Item Number b„ r* a. 



•S "g "g 



48 

50 
51 



70 
71 
72 



41 .31 1.60 .31 .^3 

42 .49 .06 . .4S .50 

43 .41 .95 .24 .25 

44 .3^+ .79 '.'52 .61 

45 .36 .65 .55 .66 
d .31 1.27-- .39 .^2 
47 .45 .29 .^+3 

^ .14 4.71 .23 .2^ 

.30 1/3'+ .39 .^+2 

.24 2.94 .24 .25 

.86 -1.69 .64 .R3 

52 .82 -1.50 .61 .77 

53 .84 -1.51 .66 .8« 
ii .73 -1.02 .60 .75 

55 .83 -1.36 .70 .98 

56 .80 -l.'+O .60 .75 

57 .84 -1.78 .56 .68 

58 .82 -1.79 .51 .59 
?Q .65 - .62 .62 .79 
In 60 - .'^i .60 .75 
IS .72 -1.17 .50 .58 

62 .67 - .75 .59 .73 

63 .65 - .62 .62 .79 

64 .66 -1.06 .39 .^f 

65 .67 - .90 .^+9 .56 

66 .79 -2.07 . .39 .^2 

67 .67 - .60 .73 1.07 

68 .71 - .91 .61 .97 

69 .59 - .37 .62 .79 
^ .70 - .95 .55 .66 

.60 - .44 .57 .69 

.58 - .53 .38 M 

73 .53 - .*17 .^5 .50 

74 .71 -1.35 .^+1 .^5 

75 .32 .75 .62 J.79 

76 .51 - .06 .i*5 -.70 

77 .39 .67 .^2 .46 

78 .44 .28 ..5^ .6^* 

79 .60 - .49 .52 .61 

80 .43 .30 .58 .71 

81 .47 .12 .63 .81 

82 .^8 .13 .^+0 .^4 

83 .50 0.00 .38 .'^i 

84 .43 .38 .^6 .52 

85 .53 - .18 .^+3 • .'^S 
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Item Number 



S 



86 


. 37 


.56 


87 




.76 


88 


.42 


.37 


89 


.48 


.10 


90 


.46 


.20 


91 




.88 


92 


.41 


.63 


93 


• 33 


1.57 


9^ 


• 32 


1.06 


95 


• 37 


.81 


96 


.22 


1.76 


97 


• 20 


1.83 


98 


• 31 


1.91 


99 


• 29 


2.13 


100 


• 10 


2.91 


101 


• 91 


-'5.16 


102 


• 86 




103 


.63 


- .77 
. r r 


104 


.81 


-2 . 31 


105 


.82 


-2.03 


106 


.85 




107 


.82 


-2.12 


108 

109 ' 


.77 

* f f 


-2.17 


.80 


-1.22 


110 


.18 


2.29 


111 


.70 


-1. 31 


112 


.48 


.13 


in 


.47 


.2U 


114 


. 68 


-1.38 




-71 




116 


.63 


_ .61 


117 


.65 


- .88 


118 


.62 


- ,60 


119 


.64 


- ,80 


120 


.58 


- .65 


121 


.62 


- r90 


122 


.57 


- .33 


123 


.57 


- .57 


124 


.56 


- .49 


125 


.47 


.14 


126 


.55 


- .33 


127 


.50 


0.00 


128 


.53 


- .19 


129 ' 


.39 


.87 


130 


.46 


.31 



.59 
.54 
.55 

.48 

.51 
.44 

.36 
.-28 
.44 
.41 
.44 
.46 
.26 
.26 
.44 
.26 
.33 
.^3 
.38 
.^5 
.29 
.^3 
.3^ 
.69 
.40 

.40 
.40 
.32 
.34 
.34 
.54 
.44 
.51 
.^5 
.31 
.34 
.54 
.31 
.31 
.55 
.38 
.36 
.40 
.32 
.32 



.73 
.64 
.66 
.55 
.59 
.49 

.39 
.29 
.49 

.49 
.52 
.27 
.27 
.49 
.27 
.35 
.48 
.41 
.50 
.30 
.48 
.36 
.95 
.44 

.44 
.44 
.34 
.36 
.36 
.64 

.49 
.59 
.50 
.33 
.,36 
.64 

.33 
.33 
.66 
.41 

.39 
.44 
.34 
.3^ 
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Item Niomber 











CO 


— in 

• •J.U 


.4R 








kq 




• *f 0 




• ^ f 


. «?3 


• 39 


<o 
• 5V 


k7 


• J J 




• J** 






• ^7 


• J- J 
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- 91 

J- 


• 


.39 


CO 


• • J. J. 








1 Oil 


. 31 


-33 


•39 


Cf\ 


• 






!• 01 




. 


• 35 




Ik 


3^ 


.44 


.29 


.52 


• r) X 


• 32 


1.34 


.35 


37 


.31 


1.03 


.48 






1.33 


.31 


.33 


.36 


1.23 


*29 


.30 


.26 


1.61 


.40 


.44 


.16 


3.68 


.27 


,28 


INFORMATION NOT AVAILABLE — 


NOT USED 


.89 


-2.31 


.53 


.63 


.80 


-1.56 


.54 


.64 


.89 


-1.83 


.67 


.90 


.85 


-1.52 


.68 


• "v 


.84 


-1.74 


.57 




.88 


-2.03 


.58 


71 
• r X 


.77 


-1.27 


.58 


71 

• r 1 


.83 


-l.fl7 


.51 


<Q 
• 7>7 


.82 


-2 ..03 


.45 




.83 


-1.65 


.58 


71 


.77 


-1.30 


.57 




.79 


-1.97 


.41 




.70 


- .90 


.58 


• fi 


.80 


-1.68 


.50 


CP 

j . •)'^ 


.65 


- .76 


.51 


CQ 


.74 


-1.46 


.44 


ko 


.75 


-1.65 


.41 


k«; 


.70 


-1.34 


.39 


•^2 


DISCRIMINATION INDEX TOO 


LOW . 


NOT USED 


.72 


-1.04 


.56 




.64 


- .65 


.55 


.66 


.65 


- .74 


.52 


.61 


.64 


- .59 


.61 


.77 


.61 


-1.07 


.26 


.27 


.66 


- .92 


.45 


.50 



131 
132 

133 
134 

135 
136 

137 
138 
139 

140 
141 
142 
143 
144 
145 
146 
147 
148 
149 
150 
151 
152 
153 
154 
155 
156 
157 
158 
159 
160 
161 
162 

163 
164 

165 
166 

167 
168 
169 
170 
171 
172 

173 
174 
175 
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176 


.65 


- .80 




.55 


177 


.64 


- .54 


• 66 


• on 


178 


.59 


- .41 


• 55 


• 66 


179 


.60 


- .51 


.50 


. 5P 


180 


.52 


- .16 


.31 


.33 
.45 


181 


.55 


- .31 




182 


.50 


0.00 


.58 


.71 


183 


.51 


- .05 


.49 


.56 


184 


.49 


.07 


- .34 


.36 


185 


.52 


- .11 


.47 


.53 


186 


.46 


.29 


.35 • 


.37 • 


187 


.50 


. 0 . 00 


ii n 
• M-O 


ii ii 


188 


.68 


- .84 


.56 


• 28 


189 


.52 


- .19 




• 2n 


190 


.53 






• 66 


191 


.43 


.38 


• M'7 


• 53 • 


192 


.50 


0» 00 


it 1 




193 


.41 


• 91 


• 25 


• /CO 


194 


.53 


- •10 




• M-n 


195 


.33 


.94 


.47 


• 53 


196 


.30 


1.25 


ii ^ 
• M'2 


ii 

• M-O 


197 


J* 

.34 


• 75 


• 55 


• 00 


198 


.24 


1.91 


.37 


• M-O 


199 


.25 


l.o9 




ii ii 
• M-M- 


200 


.22 


1»27 


• 61 


• 77 


201 


.90 


-4.13 


.31 


.33 


202 


.89 


-1.95 


.63 


• 81 


203 


.90 


-1.97 


.65 


• ft6 


204 


• 77 


-1.39 


.53 


.63 


205 


.88 


-1.90 


• 62 


.79 


206 


.83 


-1.47 


.65 


.R6 


207 


.80 


-1.87 


.45 


.50 


208 


.86 


-1.90 


.57 


• 69 


209 


.69 


-1.20 


.41 




210 


.74 


-1.37 


.'+7 


.^3 


211 


.68 


-1.95 


.24 




212 


.81 


-I.B3 


.48 


.^'^ , 


213 


.76 


-1.31 




.64 


214 


.69 


-1.08 


.46 


.52 


215 


.67 


- .72 


.61 


.77 


216 


.82 


-2.08 


.44 


.49 


217 


.71 


- .91 


' .61 


.77 


218 


.89 


-2.36 


.52 


.61 


219 


.78 


-2.03 


• .38 


.41 


220 


.83 


-1.95 


.^9 


.56 
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Item Number 



221 
222 

223 
22^+ 

225 
226 

227 
228 

229 
230 

231 
232 
233 
23^+ 
235 
236 

237 
238 
239 
2i+0 
241 
242 
243 
244 
245 
246 
247 
248 
249 
250 



s 


s 


* 

S 


e 


.73 


- .97 


.63 


.81 


.78 


-1.46 


.53 


.63 


.74 


-1.69 


.38 


.41 


.71 


- .86 


.64 


.fl3 


.80 


-1.11 


.76 


-1,17 


.66 


- .69 


.60 . 


.7'? 


.61 


-•.76 


.37 


.40 


.70 


- .83 


.63, 


.81 


.61- 


- .53 


.53' 


.73 


.65 


- .65 


.59 


.73 


.58 


- .42 


.48 


.55 


.52 


- .11 


.46 


.52 


.41 


.56 


.41 


.^5 


.57 


- .35 


.51 


.59 


.48 


.09 


.57 


.69 


.58 


- .44 


.46 


.52 


.5^+ 


- .16 


.64 


.83 


.43 


.30 


.59 


.73 


.52 


- ,08 


.61 


.77 


.42 


.40 


.51 


.59 


.61 


- .49 


.57 
.56 


.69 


.44 


.27 


.68 


.34 


.71 


.58 


.71 


.28 


1.19 


.49 


.56 


.45 


.25 


.51 


.59 


.37 


.79 


.42 


.46 


.40 


.53 


.48 


.55 


.43 


.56 


.36 


.39 


♦ 23 


1.94 


.38 . 


.41 


.14 


^4.71 


.23 


.24 
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APPENDIX C: .FORM LETTER 
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July 28, 197^ 

DeFar Orientation Participant, 

This note is a request for your help, T am a ^^raduate 
student at FSU, working on a research project, I desperately 
need participants to volunteer to help me. 

If you are willing to help, ^ will need rrom 30 to 4s 
minutes of your time sometime 'during the three-day orientation 
program. You will operate an- electronic .computer terminal for 
this study. The experience should be Interesting and Informa- 
tive for you, and may simplify your computer usage while a 
student here at Florida State, 

If you are Interested in learning more about this project, 
please meet with me at Moor^ Auditorium (in the Union Complex) , 
at 9 t 30 A, M, on Monday, the 29th, T Will explain all about 
the project and answer any questions that you may have. 
Thanks again, , 

Brian Waters 
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APPENDIX D: DES^IPTION OF DATA 
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Descriptton of Data Stored on each Testee's Data File 
Data is stffred In 10-character words 

Word No, Data ,. , 

1 Identification number or Social Security number. 

2 Keyword as entered by proctor. * 

3 Current location in program; 

0-1000 instructions 
1001-2000 Test 1 . 
2001-3000 Test 2, if given 
.3001-4000 Post-Feedback 

4 Elapsed time in seconds from time subject began 

instructions until, testing was completed. 

5 Total time in seconds spent'on instructions. ' 

6 Total time in seconds spent on test 1. 

7 lJumber of errors on instructional screens 1-10. 

1 character per screen. 
8' Number of" errors on instructional screens 11-20. 

1 'Character per screen. 
9 Number' of items correct on test 1. 

10-12 . Testee's name, '30 characters. 

13 Characters" 1-2: subject's estimated ability-, 
' 'if taken. 

Characters 3-8: blank 
Characters 9-10: college code (01-27) 

14 Social security number, if available. 
15. Date of testing 

16 Seconds since midnight when testing began. 

17 Elapsed time in seconds spent on test 2. 

18 . Maximum number of questions which could be 
, given on test 1. 

19 Maximum number of questions which could be 

given on test 2. 

20 Number of items attempted on test 1. 

21 Number of items attempted on test 2. 

22 First score on test 1. 

23 First score on test '2. 

24 (reserved for program for recovery information) 

25 Number of items correct on test 2.. 

26 Second score on test 1. 

27 Second score on test 2. 

28-30 (reserved for program for recovery information) 
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Data on each vocabulary item is packed into one word 
as follows: 



charac 



te( 1: 



2 

3-6 
7 

8-10 



response code 



code 

0 
1 

2 



meaning 



Item answered 
Item answered 
rtem answered 



incorrectly 
cprrectly 
with a ?. 



actual response (1-5, ?+0)^ 
reference number of item presented 
number of presentations of screen 
response latency in seconds 
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